Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2 #6453

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

DisOOM
Copy link

@DisOOM DisOOM commented Apr 3, 2024

Statement: This has nothing to do with the fine-grained MoE architecture in Qwen/Qwen1.5-MoE-A2.7B. It is more akin to a traditional MoE, except that its experts are derived from the qwen2 (qwen1.5) model.

I was previously using mergekit-moe to merge the qwen1.5 model into an MoE, but the resulting models were corrupted after being converted into the gguf format.
Subsequently, I discovered this custom mergekit script that successfully merges into qwen2MoE: https://github.com/Aratako/mergekit-qwen2. Following the example of #4912, I made some modifications to llama.cpp, enabling it to correctly convert, quantize, and run MoEs merged using this custom script.
It performs well on older versions, but I encountered errors with the latest version. It can correctly convert and quantize but fails to run. I believe the issue lies in incompatibility with the changes made to llamacpp in #6122, but I am unsure how to resolve this problem.

I am a newbie to coding and this is my first PR, please be lenient.

I encountered no issues when converting with convert-hf-to-gguf.py and quantizing with quantize.exe, but I encountered the following issues when I ran main.exe.

PS D:\llama.cpp\llama.cpp> ./build/bin/Release/main.exe -m D:/model/ggml-model-f16.gguf -n 128
Log start
main: build = 2585 (f87f7b89)
main: built with MSVC 19.39.33523.0 for x64
main: seed  = 1712122664
llama_model_loader: loaded meta data with 21 key-value pairs and 643 tensors from D:/model-merge/Merged/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Merged
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 40
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 13696
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 40
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                         qwen2.expert_count u32              = 2
llama_model_loader: - kv  11:                    qwen2.expert_used_count u32              = 2
llama_model_loader: - kv  12:                          general.file_type u32              = 1
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 151645
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - type  f32:  201 tensors
llama_model_loader: - type  f16:  442 tensors
llm_load_vocab: special tokens definition check successful ( 421/152064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13696
llm_load_print_meta: n_expert         = 2
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 22.58 B
llm_load_print_meta: model size       = 42.07 GiB (16.00 BPW)
llm_load_print_meta: general.name     = Merged
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151645 '<|im_end|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_tensors: ggml ctx size =    0.25 MiB
llm_load_tensors:        CPU buffer size = 43074.71 MiB
................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   400.00 MiB
llama_new_context_with_model: KV self size  =  400.00 MiB, K (f16):  200.00 MiB, V (f16):  200.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =   353.76 MiB
llama_new_context_with_model: graph nodes  = 2164
llama_new_context_with_model: graph splits = 1
GGML_ASSERT: D:\llama.cpp\llama.cpp:9701: lctx.inp_out_ids && "every model that can must skip unused outputs"

@DisOOM DisOOM changed the title Adding Support for Custom Qwen2moe Architectures Using mergekit-qwen2 Adding Support for Custom Qwen2moe Architectures withmergekit-qwen2 Apr 3, 2024
@DisOOM DisOOM changed the title Adding Support for Custom Qwen2moe Architectures withmergekit-qwen2 Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2 Apr 3, 2024
Copy link
Contributor

github-actions bot commented Apr 3, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 503 iterations 🚀

  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=9295.2ms p(90)=26525.97ms fails=0, finish reason: stop=503 truncated=0
  • Prompt processing (pp): avg=241.95tk/s p(90)=732.6tk/s total=200.08tk/s
  • Token generation (tg): avg=98.97tk/s p(90)=277.24tk/s total=130.21tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=115f49a08a1c9fd59c60ed1425827d9ae2614565
Time series

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 503 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712129291 --> 1712129917
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 274.95, 274.95, 274.95, 274.95, 274.95, 623.49, 623.49, 623.49, 623.49, 623.49, 646.27, 646.27, 646.27, 646.27, 646.27, 672.33, 672.33, 672.33, 672.33, 672.33, 702.26, 702.26, 702.26, 702.26, 702.26, 711.49, 711.49, 711.49, 711.49, 711.49, 710.87, 710.87, 710.87, 710.87, 710.87, 691.5, 691.5, 691.5, 691.5, 691.5, 691.23, 691.23, 691.23, 691.23, 691.23, 691.08, 691.08, 691.08, 691.08, 691.08, 697.17, 697.17, 697.17, 697.17, 697.17, 714.08, 714.08, 714.08, 714.08, 714.08, 735.9, 735.9, 735.9, 735.9, 735.9, 744.08, 744.08, 744.08, 744.08, 744.08, 733.46, 733.46, 733.46, 733.46, 733.46, 695.97, 695.97, 695.97, 695.97, 695.97, 699.29, 699.29, 699.29, 699.29, 699.29, 699.47, 699.47, 699.47, 699.47, 699.47, 704.05, 704.05, 704.05, 704.05, 704.05, 711.21, 711.21, 711.21, 711.21, 711.21, 708.94, 708.94, 708.94, 708.94, 708.94, 706.46, 706.46, 706.46, 706.46, 706.46, 708.91, 708.91, 708.91, 708.91, 708.91, 708.27, 708.27, 708.27, 708.27, 708.27, 709.56, 709.56, 709.56, 709.56, 709.56, 724.76, 724.76, 724.76, 724.76, 724.76, 726.05, 726.05, 726.05, 726.05, 726.05, 727.15, 727.15, 727.15, 727.15, 727.15, 733.43, 733.43, 733.43, 733.43, 733.43, 730.58, 730.58, 730.58, 730.58, 730.58, 727.73, 727.73, 727.73, 727.73, 727.73, 727.66, 727.66, 727.66, 727.66, 727.66, 724.75, 724.75, 724.75, 724.75, 724.75, 724.06, 724.06, 724.06, 724.06, 724.06, 726.06, 726.06, 726.06, 726.06, 726.06, 733.35, 733.35, 733.35, 733.35, 733.35, 735.28, 735.28, 735.28, 735.28, 735.28, 737.03, 737.03, 737.03, 737.03, 737.03, 741.18, 741.18, 741.18, 741.18, 741.18, 738.92, 738.92, 738.92, 738.92, 738.92, 737.42, 737.42, 737.42, 737.42, 737.42, 737.96, 737.96, 737.96, 737.96, 737.96, 737.32, 737.32, 737.32, 737.32, 737.32, 746.58, 746.58, 746.58, 746.58, 746.58, 745.49, 745.49, 745.49, 745.49, 745.49, 740.22, 740.22, 740.22, 740.22, 740.22, 739.68, 739.68, 739.68, 739.68, 739.68, 736.68, 736.68, 736.68, 736.68, 736.68, 734.52, 734.52, 734.52, 734.52, 734.52, 736.54, 736.54, 736.54, 736.54, 736.54, 738.49, 738.49, 738.49, 738.49, 738.49, 737.84, 737.84, 737.84, 737.84, 737.84, 738.39, 738.39, 738.39, 738.39, 738.39, 739.33, 739.33, 739.33, 739.33, 739.33, 741.19, 741.19, 741.19, 741.19, 741.19, 741.92, 741.92, 741.92, 741.92, 741.92, 741.13, 741.13, 741.13, 741.13, 741.13, 740.7, 740.7, 740.7, 740.7, 740.7, 742.67, 742.67, 742.67, 742.67, 742.67, 742.16, 742.16, 742.16, 742.16, 742.16, 742.95, 742.95, 742.95, 742.95]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 503 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712129291 --> 1712129917
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 29.69, 29.69, 29.69, 29.69, 29.69, 17.1, 17.1, 17.1, 17.1, 17.1, 18.55, 18.55, 18.55, 18.55, 18.55, 18.45, 18.45, 18.45, 18.45, 18.45, 19.19, 19.19, 19.19, 19.19, 19.19, 19.9, 19.9, 19.9, 19.9, 19.9, 20.56, 20.56, 20.56, 20.56, 20.56, 20.63, 20.63, 20.63, 20.63, 20.63, 20.66, 20.66, 20.66, 20.66, 20.66, 20.47, 20.47, 20.47, 20.47, 20.47, 20.49, 20.49, 20.49, 20.49, 20.49, 20.4, 20.4, 20.4, 20.4, 20.4, 20.12, 20.12, 20.12, 20.12, 20.12, 19.98, 19.98, 19.98, 19.98, 19.98, 19.3, 19.3, 19.3, 19.3, 19.3, 19.34, 19.34, 19.34, 19.34, 19.34, 19.11, 19.11, 19.11, 19.11, 19.11, 19.23, 19.23, 19.23, 19.23, 19.23, 19.16, 19.16, 19.16, 19.16, 19.16, 18.91, 18.91, 18.91, 18.91, 18.91, 18.8, 18.8, 18.8, 18.8, 18.8, 18.67, 18.67, 18.67, 18.67, 18.67, 18.57, 18.57, 18.57, 18.57, 18.57, 18.58, 18.58, 18.58, 18.58, 18.58, 18.52, 18.52, 18.52, 18.52, 18.52, 18.61, 18.61, 18.61, 18.61, 18.61, 18.7, 18.7, 18.7, 18.7, 18.7, 18.69, 18.69, 18.69, 18.69, 18.69, 18.61, 18.61, 18.61, 18.61, 18.61, 18.49, 18.49, 18.49, 18.49, 18.49, 18.44, 18.44, 18.44, 18.44, 18.44, 18.48, 18.48, 18.48, 18.48, 18.48, 18.52, 18.52, 18.52, 18.52, 18.52, 18.66, 18.66, 18.66, 18.66, 18.66, 18.7, 18.7, 18.7, 18.7, 18.7, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.59, 18.59, 18.59, 18.59, 18.59, 18.51, 18.51, 18.51, 18.51, 18.51, 18.48, 18.48, 18.48, 18.48, 18.48, 18.47, 18.47, 18.47, 18.47, 18.47, 18.5, 18.5, 18.5, 18.5, 18.5, 18.49, 18.49, 18.49, 18.49, 18.49, 18.43, 18.43, 18.43, 18.43, 18.43, 18.27, 18.27, 18.27, 18.27, 18.27, 18.26, 18.26, 18.26, 18.26, 18.26, 18.2, 18.2, 18.2, 18.2, 18.2, 17.77, 17.77, 17.77, 17.77, 17.77, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.49, 17.49, 17.49, 17.49, 17.49, 17.55, 17.55, 17.55, 17.55, 17.55, 17.58, 17.58, 17.58, 17.58, 17.58, 17.62, 17.62, 17.62, 17.62, 17.62, 17.65, 17.65, 17.65, 17.65, 17.65, 17.64, 17.64, 17.64, 17.64, 17.64, 17.63, 17.63, 17.63, 17.63, 17.63, 17.58, 17.58, 17.58, 17.58, 17.58, 17.54, 17.54, 17.54, 17.54, 17.54, 17.58, 17.58, 17.58, 17.58]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 503 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712129291 --> 1712129917
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.06, 0.06, 0.06, 0.06, 0.06, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.28, 0.28, 0.28, 0.28, 0.28, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.29, 0.29, 0.29, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28, 0.27, 0.27, 0.27, 0.27, 0.27, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.34, 0.34, 0.34, 0.34, 0.34, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.31, 0.31, 0.31, 0.31, 0.31, 0.45, 0.45, 0.45, 0.45, 0.45, 0.47, 0.47, 0.47, 0.47, 0.47, 0.49, 0.49, 0.49, 0.49, 0.49, 0.52, 0.52, 0.52, 0.52, 0.52, 0.33, 0.33, 0.33, 0.33, 0.33, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.21, 0.21, 0.21, 0.21, 0.21, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.28, 0.28, 0.28, 0.28, 0.28, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 503 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712129291 --> 1712129917
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0]
                    
Loading

@maziyarpanahi
Copy link

Thanks @DisOOM for creating this PR based on our discussion regarding why MoE models based on Qwen don't work properly.

I will tag @compilade and @slaren who was involve in the PR you mentioned. However, have you tried using this PR to see if the MoE models based on Qwen architecture works properly? #6387

I am testing #6387 now for DBRX, but if it's to solve issues with MoE (not sure if there is a difference between Mergekit MoE and others like Qwen, Mixtral, DBRX). I would personally try it to see if my quantized Qwen MoE model would work.

@ggerganov
Copy link
Owner

Qwen MoE models should be able to work after merging #6387 and then #6074

DBRX models likely also depend on #6387 + we need conversion scripts and compute graph implementation

@DisOOM
Copy link
Author

DisOOM commented Apr 3, 2024

Thanks @DisOOM for creating this PR based on our discussion regarding why MoE models based on Qwen don't work properly.感谢根据我们讨论的内容创建这个基于 Qwen 的 MoE 模型无法正常工作的 PR。

I will tag @compilade and @slaren who was involve in the PR you mentioned. However, have you tried using this PR to see if the MoE models based on Qwen architecture works properly? #6387我会标记并提及在你提到的 PR 中参与的人。不过,你是否尝试过使用这个 PR 来查看基于 Qwen 架构的 MoE 模型是否正常工作?#6387

I am testing #6387 now for DBRX, but if it's to solve issues with MoE (not sure if there is a difference between Mergekit MoE and others like Qwen, Mixtral, DBRX). I would personally try it to see if my quantized Qwen MoE model would work.我正在为 DBRX 进行 #6387 的测试,但如果它是为了解决 MoE 的问题(不确定 Mergekit MoE 和其他模型如 Qwen、Mixtral、DBRX 之间是否有区别)。我个人会尝试一下,看看我的量化 Qwen MoE 模型是否能正常工作。

I haven't tried this PR yet. I will give it a try later.

@maziyarpanahi
Copy link

I have pulled and used the latest changes from the master branch. I have successfully converted this model into fp16 GGUF: https://huggingface.co/MaziyarPanahi/Qwen1.5-8x7b-v0.1

It works very fine and has a coherent output. However, any quantized model from this pf16 results in the following error:

..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =   343.26 MiB
llama_new_context_with_model: graph nodes  = 1638
llama_new_context_with_model: graph splits = 1
GGML_ASSERT: ggml.c:11015: wdata == wdata_src1_end
Aborted (core dumped)

@ggerganov I am not sure what causes this error. This is a MoE made by MergeKit based on Qwen models. (one of those situation where the fp16 GGUF model works fine, but the quantized just either crashes or outputs nonsense)

@mofosyne mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs model Model specific labels May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model Model specific Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants