Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Support Mixtral-8x22B-v0.1 #3983

Closed
yh-yao opened this issue Apr 10, 2024 · 4 comments
Closed

[Feature]: Support Mixtral-8x22B-v0.1 #3983

yh-yao opened this issue Apr 10, 2024 · 4 comments

Comments

@yh-yao
Copy link

yh-yao commented Apr 10, 2024

🚀 The feature, motivation and pitch

Do we support running Mixtral-8x22B-v0.1 now? It takes very long for compiling on my side.
https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1

Alternatives

No response

Additional context

No response

@simon-mo
Copy link
Collaborator

vLLM should be able to support it as-is without modification. I have heard report people successfully running it. The out of the box performance should be ok but there are rooms for improvement on tuning the MoE kernels. cc @pcmoritz @richardliaw if your team have bandwidth to help tune.

@ywang96
Copy link
Member

ywang96 commented Apr 11, 2024

Can confirm https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1 already works on latest PyPI release.

I've created #4002 to add the configs for the moe kernels - feel free to test them out.

@yh-yao
Copy link
Author

yh-yao commented Apr 13, 2024

It is working now. vLLM is so good, although it requires more GPU memory than the normal one. Thank you very much for the help!!!

@yh-yao yh-yao closed this as completed Apr 13, 2024
@chenliverantos
Copy link

Hi guys,

For the mixtral X8 models, do you know if the VLLM implementation supports the full sparse matmul optimization designed for MoE models, like the original white paper:
https://arxiv.org/pdf/2401.04088
has specified. More specifically, using the recommended MEGABLOCKS sparse matrix computation described here:
https://arxiv.org/pdf/2211.15841
in the form of Blocked Compressed Sparse Row method ?

Does the VLLM implementation have everything enabled for sparse matmul optimization, or would it need additional infrastructure steps to enable it, or would it not be possible to work with it in the exact manner that MEGABLOCKS parallel optimization has specified?

We have not been able to find sufficiently specific documentation anywhere in VLLM to make this clear. Would appreciate any information you guys can furnish.

Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants