Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Fused MoE Config for Mixtral 8x22 #4002

Merged
merged 1 commit into from
Apr 11, 2024

Conversation

ywang96
Copy link
Member

@ywang96 ywang96 commented Apr 11, 2024

This PR adds fused MoE configs for Mixtral 8x22 on A100-80G & H100 with TP4 & 8 - Roughly 10% latency speedup for high batch sizes.

Attaching mean and median results from running benchmark_latency.py on A100-80G as a sanity check

TP4
bs 8
Avg latency: 5.787593524421876 seconds
50% percentile latency: 5.785140061285347 seconds

bs 16
Avg latency: 6.900521802188208 seconds
50% percentile latency: 6.910593919456005 seconds

bs 32
Avg latency: 7.437153746870657 seconds
50% percentile latency: 7.425246893661097 seconds

TP4 with config
bs 8
Avg latency: 5.4268179705676935 seconds
50% percentile latency: 5.4153253806289285 seconds

bs 16
Avg latency: 6.107941208345195 seconds
50% percentile latency: 6.10570535203442 seconds

bs 32
Avg latency: 6.893939550562451 seconds
50% percentile latency: 6.877172949025407 seconds

TP8
bs 8
Avg latency: 3.83942656442523 seconds
50% percentile latency: 3.8410149838309735 seconds

bs 16
Avg latency: 4.428924084718649 seconds
50% percentile latency: 4.421724954620004 seconds

bs 32
Avg latency: 4.767531445249915 seconds
50% percentile latency: 4.752460908377543 seconds


TP8 with config
bs 8
Avg latency: 3.7083894722008455 seconds
50% percentile latency: 3.7043474409729242 seconds

bs 16
Avg latency: 4.047313968433688 seconds
50% percentile latency: 4.0400408857967705 seconds

bs 32
Avg latency: 4.390694409639885 seconds
50% percentile latency: 4.380223051644862 seconds

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ywang96 LGTM. Thanks for the PR!

@WoosukKwon WoosukKwon merged commit c1dc547 into vllm-project:main Apr 11, 2024
35 checks passed
andy-neuma pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 12, 2024
@ywang96 ywang96 deleted the moe-mixtral8x22 branch April 13, 2024 08:19
z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
@Fkawala
Copy link

Fkawala commented Nov 12, 2024

@ywang96 I'm not an expert could you explain how what would be the process to adapt this config to NVIDIA_A100-SXM4-40GB ? Thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants