[Kernel] Fused MoE Config for Mixtral 8x22 #4002

ywang96 · 2024-04-11T09:47:28Z

This PR adds fused MoE configs for Mixtral 8x22 on A100-80G & H100 with TP4 & 8 - Roughly 10% latency speedup for high batch sizes.

Attaching mean and median results from running benchmark_latency.py on A100-80G as a sanity check

TP4
bs 8
Avg latency: 5.787593524421876 seconds
50% percentile latency: 5.785140061285347 seconds

bs 16
Avg latency: 6.900521802188208 seconds
50% percentile latency: 6.910593919456005 seconds

bs 32
Avg latency: 7.437153746870657 seconds
50% percentile latency: 7.425246893661097 seconds

TP4 with config
bs 8
Avg latency: 5.4268179705676935 seconds
50% percentile latency: 5.4153253806289285 seconds

bs 16
Avg latency: 6.107941208345195 seconds
50% percentile latency: 6.10570535203442 seconds

bs 32
Avg latency: 6.893939550562451 seconds
50% percentile latency: 6.877172949025407 seconds

TP8
bs 8
Avg latency: 3.83942656442523 seconds
50% percentile latency: 3.8410149838309735 seconds

bs 16
Avg latency: 4.428924084718649 seconds
50% percentile latency: 4.421724954620004 seconds

bs 32
Avg latency: 4.767531445249915 seconds
50% percentile latency: 4.752460908377543 seconds


TP8 with config
bs 8
Avg latency: 3.7083894722008455 seconds
50% percentile latency: 3.7043474409729242 seconds

bs 16
Avg latency: 4.047313968433688 seconds
50% percentile latency: 4.0400408857967705 seconds

bs 32
Avg latency: 4.390694409639885 seconds
50% percentile latency: 4.380223051644862 seconds

WoosukKwon

@ywang96 LGTM. Thanks for the PR!

Fkawala · 2024-11-12T18:05:28Z

@ywang96 I'm not an expert could you explain how what would be the process to adapt this config to NVIDIA_A100-SXM4-40GB ? Thank you !

config for mixtral 8x22

4171f90

ywang96 mentioned this pull request Apr 11, 2024

[Feature]: Support Mixtral-8x22B-v0.1 #3983

Closed

WoosukKwon approved these changes Apr 11, 2024

View reviewed changes

WoosukKwon merged commit c1dc547 into vllm-project:main Apr 11, 2024
35 checks passed

andy-neuma pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 12, 2024

[Kernel] Fused MoE Config for Mixtral 8x22 (vllm-project#4002)

1644532

ywang96 deleted the moe-mixtral8x22 branch April 13, 2024 08:19

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024

[Kernel] Fused MoE Config for Mixtral 8x22 (vllm-project#4002)

61be64a

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Kernel] Fused MoE Config for Mixtral 8x22 (vllm-project#4002)

9cc1dea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Fused MoE Config for Mixtral 8x22 #4002

[Kernel] Fused MoE Config for Mixtral 8x22 #4002

ywang96 commented Apr 11, 2024 •

edited

Loading

WoosukKwon left a comment

Fkawala commented Nov 12, 2024

[Kernel] Fused MoE Config for Mixtral 8x22 #4002

[Kernel] Fused MoE Config for Mixtral 8x22 #4002

Conversation

ywang96 commented Apr 11, 2024 • edited Loading

WoosukKwon left a comment

Choose a reason for hiding this comment

Fkawala commented Nov 12, 2024

ywang96 commented Apr 11, 2024 •

edited

Loading