Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] DeepSeek-Coder-V2-Instruct-FP8 on 8xA100 #989

Closed
halexan opened this issue Aug 8, 2024 · 9 comments
Closed

[Feature] DeepSeek-Coder-V2-Instruct-FP8 on 8xA100 #989

halexan opened this issue Aug 8, 2024 · 9 comments

Comments

@halexan
Copy link

halexan commented Aug 8, 2024

Motivation

VLLM has announced their support for running llama3.1-405b-fp8 on 8xA100. This is the blog

Does sglang support running DeepSeek-Coder-V2-Instruct-FP8 on 8xA100?

Related resources

No response

@Ying1123
Copy link
Member

Ying1123 commented Aug 8, 2024

llama-405b-fp8 is supported in sglang

sglang/README.md

Lines 199 to 200 in 228cf47

## Run 405B (fp8) on a single node
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
.

DeepSeek-Coder-V2-Instruct-FP8 should be supported as well. Could you try it and let us know if there are any problems?

@Xu-Chen
Copy link
Contributor

Xu-Chen commented Aug 8, 2024

VLLM don't support MoE FP8 models on Ampere. This is because vLLM uses Triton for its FusedMoE kernel, which doesn't support the FP8 Marlin mixed-precision gemm. See https://huggingface.co/neuralmagic/DeepSeek-Coder-V2-Instruct-FP8/discussions/1

Running DeepSeek-Coder-V2-Lite-Instruct-FP8, there is an error

  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
    model.load_weights(
  File "/root/.local/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 694, in load_weights
    weight_loader(
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 205, in weight_loader
    raise ValueError(
ValueError: input_scales of w1 and w3 of a layer must be equal. But got 0.06986899673938751 vs. 0.09467455744743347

@halexan
Copy link
Author

halexan commented Aug 10, 2024

VLLM don't support MoE FP8 models on Ampere. This is because vLLM uses Triton for its FusedMoE kernel, which doesn't support the FP8 Marlin mixed-precision gemm. See https://huggingface.co/neuralmagic/DeepSeek-Coder-V2-Instruct-FP8/discussions/1

Running DeepSeek-Coder-V2-Lite-Instruct-FP8, there is an error

  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
    model.load_weights(
  File "/root/.local/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 694, in load_weights
    weight_loader(
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 205, in weight_loader
    raise ValueError(
ValueError: input_scales of w1 and w3 of a layer must be equal. But got 0.06986899673938751 vs. 0.09467455744743347

What is your vllm version?

@Xu-Chen
Copy link
Contributor

Xu-Chen commented Aug 10, 2024

VLLM don't support MoE FP8 models on Ampere. This is because vLLM uses Triton for its FusedMoE kernel, which doesn't support the FP8 Marlin mixed-precision gemm. See https://huggingface.co/neuralmagic/DeepSeek-Coder-V2-Instruct-FP8/discussions/1

Running DeepSeek-Coder-V2-Lite-Instruct-FP8, there is an error

  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 327, in load_model
    model.load_weights(
  File "/root/.local/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 694, in load_weights
    weight_loader(
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 205, in weight_loader
    raise ValueError(
ValueError: input_scales of w1 and w3 of a layer must be equal. But got 0.06986899673938751 vs. 0.09467455744743347

What is your vllm version?

0.5.4

@KylinMountain
Copy link
Contributor

@Xu-Chen So can we use sglang to run deepseek v2 232B? Thanks

@halexan
Copy link
Author

halexan commented Aug 12, 2024

@Xu-Chen So can we use sglang to run deepseek v2 232B? Thanks

Yes, you can, without quantization.

@merrymercy
Copy link
Contributor

all of them should be supported in v0.3.1.post3. see also blog https://lmsys.org/blog/2024-09-04-sglang-v0-3/

@Xu-Chen
Copy link
Contributor

Xu-Chen commented Sep 22, 2024

all of them should be supported in v0.3.1.post3. see also blog https://lmsys.org/blog/2024-09-04-sglang-v0-3/

Have you test on A100?

@merrymercy
Copy link
Contributor

a100 does not support fp8 natively, so i guess it is not supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants