Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fused MOE for Mixtral #2542

Merged
merged 23 commits into from
Jan 30, 2024
Merged

Fused MOE for Mixtral #2542

merged 23 commits into from
Jan 30, 2024

Conversation

pcmoritz
Copy link
Collaborator

@pcmoritz pcmoritz commented Jan 22, 2024

This builds on #2453 and #2293 to fuse the MOE kernel for the Mixtral model.

It seems to give a significant performance improvement though (in my setup from 28600 to 33600 tok / s with 1000 input tokens and 50 output tokens on H100).

Latency with python benchmarks/benchmark_latency.py --model=mistralai/Mixtral-8x7B-Instruct-v0.1 --input-len 1000 --output-len 50 -tp 8 --num-iters 100 --batch-size <bs>:

This PR:

bs=1: 0.459s
bs=2: 0.515s
bs=4: 0.610s
bs=8: 0.813s
bs=16: 1.044s
bs=32: 1.489s
bs=64: 2.419s

Compare to master:

bs=1: 0.590s
bs=2: 0.631s
bs=4: 0.709s
bs=8: 0.838s
bs=16: 1.086s
bs=32: 1.615s
bs=64: 2.727s

@pcmoritz pcmoritz changed the title [WIP] Fused MOE for Mixtral Fused MOE for Mixtral Jan 22, 2024
@pcmoritz
Copy link
Collaborator Author

MMLU evaluation on this PR looks good as well:

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7038|±  |0.1407|
| - humanities     |N/A    |none  |     5|acc   |0.6459|±  |0.1578|
| - other          |N/A    |none  |     5|acc   |0.7763|±  |0.1104|
| - social_sciences|N/A    |none  |     5|acc   |0.8109|±  |0.0704|
| - stem           |N/A    |none  |     5|acc   |0.6143|±  |0.1402|

@pcmoritz
Copy link
Collaborator Author

Latency numbers:

python benchmarks/benchmark_latency.py --model=mistralai/Mixtral-8x7B-Instruct-v0.1 --input-len 1000 --output-len 50 -tp 8 --num-iters 100 --batch-size <bs>

This PR:

bs=1: 0.561s
bs=2: 0.600s
bs=4: 0.725s
bs=8: 0.912s
bs=16: 1.131s
bs=32: 1.547s
bs=64: 2.43s

Master:

bs=1: 0.590s
bs=2: 0.631s
bs=4: 0.709s
bs=8: 0.838s
bs=16: 1.086s
bs=32: 1.615s
bs=64: 2.727s

So very nice improvements on both throughput and latency (except for some medium batch sizes, but maybe that can be further optimized by tuning the block sizes better).

@pcmoritz
Copy link
Collaborator Author

With the latest version of the fused MOE kernel, the fused kernel is now strictly dominating the current master (same settings as above):

bs=1: 0.459s
bs=2: 0.515s
bs=4: 0.610s
bs=8: 0.813s
bs=16: 1.044s
bs=32: 1.489s
bs=64: 2.419s

@WoosukKwon WoosukKwon self-requested a review January 26, 2024 07:30
@pcmoritz
Copy link
Collaborator Author

@WoosukKwon It probably makes sense to review/merge #2453 first since the fused_moe kernel is from there :)

@casper-hansen
Copy link
Contributor

casper-hansen commented Jan 28, 2024

@pcmoritz I tried importing your code from here and found that there is a absolute maximum difference of 0.3545 in the logits between the normal Mixtral MoE and the fused one.

It seems this is a large difference. Could you add a test between the normal MixtralSparseMoeBlock from transformers and the fused module?

@pcmoritz
Copy link
Collaborator Author

pcmoritz commented Jan 28, 2024

Thanks @casper-hansen, let me look into this some more / compare the numerics with HuggingFace. Here is what I've figured out so far:

import torch
from transformers import AutoModelForCausalLM

from vllm.model_executor.layers.moe import MoE

from vllm.model_executor.models.mixtral import MixtralModel
from vllm.model_executor.models.mixtral import MixtralForCausalLM


model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1")
config = model.config

mixtral_moe = model.model.layers[0].block_sparse_moe

hidden_states = torch.randn((1, 1, 4096))
output = mixtral_moe.forward(hidden_states)

vLLM:

First initialize model parallelism (this is needed b/c the model is trying to get the tensor parallelism which needs this to be initialized -- maybe going forward we can make the models run independent of that, it might be useful e.g. for unit tests):

from vllm.model_executor.parallel_utils.parallel_state import initialize_model_parallel

torch.distributed.init_process_group(
    backend="nccl",
    world_size=1,
    rank=0,
    init_method=f"file:///tmp/test",
)

initialize_model_parallel()
vllm_moe = MoE(
    config.num_local_experts,
    config.num_experts_per_tok,
    config.hidden_size,
    config.intermediate_size,
    params_dtype=torch.bfloat16
)

# Load weights:

from vllm.model_executor.weight_utils import hf_model_weights_iterator

expert_params_mapping = [
    ("ws" if weight_name in ["w1", "w3"] else "w2s",
    f"experts.{expert_id}.{weight_name}.weight", expert_id)
    for expert_id in range(config.num_local_experts)
    for weight_name in ["w1", "w2", "w3"]
]

params_dict = dict(vllm_moe.named_parameters())
for name, loaded_weight in hf_model_weights_iterator("mistralai/Mixtral-8x7B-v0.1"):
    if name == "model.layers.0.block_sparse_moe.gate.weight":
        params_dict["gate.weight"][:,:] = loaded_weight
    if name.startswith("model.layers.0.block_sparse_moe.experts"):
        for param_name, weight_name, expert_id in expert_params_mapping:
            if weight_name in name:
                param = params_dict[param_name]
                weight_loader = param.weight_loader
                weight_loader(param, loaded_weight, weight_name, expert_id=expert_id)

vllm_moe.forward(hidden_states.bfloat16().to("cuda"))

@casper-hansen
Copy link
Contributor

See the test I created below for reference. I'm not sure what causes the difference, but seems it's a large difference.

https://github.com/casper-hansen/AutoAWQ/blob/mixtral_fused/tests/test_fused_moe.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcmoritz Should we move the MoEclass back to the Mixtral model file? It seems like this MoE layer is not shared between Mixtral and DeepSeek.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me! Feel free to make any edits to the PR you'd like to make or let me know if I should make them :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd appreciate it if you can do it!

@pcmoritz
Copy link
Collaborator Author

@casper-hansen I don't know if you followed the discussion -- we looked into the numerical differences (#2453 (comment)) and they are due to the TensorFloat tensor cores being used, so it is expected :)

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcmoritz LGTM! Thanks for the great work!

@WoosukKwon WoosukKwon merged commit ab40644 into vllm-project:main Jan 30, 2024
15 of 17 checks passed
@WoosukKwon
Copy link
Collaborator

BTW, I added @scv119 (who wrote #2293) as a co-author of the PR. Thanks @scv119 for the original PR!

@pcmoritz
Copy link
Collaborator Author

Just to be sure, I re-ran MMLU on the latest version of this PR and the result looks good:

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7052|±  |0.1374|
| - humanities     |N/A    |none  |     5|acc   |0.6491|±  |0.1522|
| - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1099|
| - social_sciences|N/A    |none  |     5|acc   |0.8109|±  |0.0690|
| - stem           |N/A    |none  |     5|acc   |0.6178|±  |0.1384|

@WoosukKwon
Copy link
Collaborator

Oh BTW, this PR will break the quantization support for Mixtral. 🤦

@zhuohan123 zhuohan123 mentioned this pull request Jan 31, 2024
30 tasks
NikolaBorisov pushed a commit to deepinfra/vllm that referenced this pull request Jan 31, 2024
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
alexm-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Feb 13, 2024
@skt7
Copy link
Contributor

skt7 commented Feb 14, 2024

benchmarks

@pcmoritz are you using any specific implememtation to run MMLU benchmark (and others) on LLMs served through vllm, it will be great if you can share the details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants