[Feature]: DeepSeek-Coder-V2-Instruct-FP8 on 8xA100 #7322

halexan · 2024-08-09T02:01:52Z

🚀 The feature, motivation and pitch

VLLM has announced support for running llama3.1-405b-fp8 on 8xA100. This is the blog

Does vllm support running DeepSeek-Coder-V2-Instruct-FP8 on 8xA100?

However, I notice that vLLM uses Triton for its FusedMoE kernel, which doesn't support the FP8 Marlin mixed-precision. See sgl-project/sglang#989 (comment)

Is there any work around?

Alternatives

No response

Additional context

No response

robertgshaw2-neuralmagic · 2024-08-12T00:56:04Z

There is not currently a workaround for this. We have been working on extending Marlin to support FusedMoE and will likely extend this to fp8 at some point. But this will take some time

see: #7079 for progress of marlin fused_moe

robertgshaw2-neuralmagic · 2024-08-12T00:56:09Z

Closing for now.

jon-chuang · 2024-08-12T22:50:44Z

Hello @robertgshaw2-neuralmagic , may I ask why an FP8 quantized model would used an FP16XINT4 mm kernel? Could you point to some resources or blog post about this? Thank you.

robertgshaw2-neuralmagic · 2024-08-12T22:53:50Z

Marlin is a mixed precision inference kernel. It supports int4 weights, int8 weights, and fp8 weights with 16 bit activations (for dense models)

we started by extending marlin to support fused moe with int4 and int8 weights and fp16 activations (the pr I linked). A follow up to this will be extending to support fp8 weights as well

jon-chuang · 2024-08-12T23:09:54Z

At what batch size does Marlin become optimal (I.e. roofline) for FP8?

robertgshaw2-neuralmagic · 2024-08-12T23:15:16Z

I’m not sure I follow the question.

The Roofline analysis shows the latency of the kernel as a function of batch size. Marlin GEMM is a highly optimized kernel that was designed to address performance issues with the prior generation of mixed precision kernels which did not perform well in the batch 8-64 range even though the computation is memory bound.

So, marlin follows the roofline plot very well. But, you should not expect marlin to accelerate compute bound workloads over fp16. For compute bound workloads we recommend using activation quantization

robertgshaw2-neuralmagic · 2024-08-12T23:18:37Z

One follow up - If you’re running on Hopper, I don’t think it makes sense to use marlin for fp8 since we can use dyanmic activation quantization with high accuracy. The only use of marlin fp8 IMO should be for devices which do not support fp8 compute (I.e a100)

jon-chuang · 2024-08-12T23:48:54Z

I see, thank you for the detailed response!

halexan added the feature request label Aug 9, 2024

robertgshaw2-neuralmagic closed this as completed Aug 12, 2024

fengyang95 mentioned this issue Sep 1, 2024

feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 sgl-project/sglang#1285

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: DeepSeek-Coder-V2-Instruct-FP8 on 8xA100 #7322

[Feature]: DeepSeek-Coder-V2-Instruct-FP8 on 8xA100 #7322

halexan commented Aug 9, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Aug 12, 2024

robertgshaw2-neuralmagic commented Aug 12, 2024

jon-chuang commented Aug 12, 2024

robertgshaw2-neuralmagic commented Aug 12, 2024

jon-chuang commented Aug 12, 2024

robertgshaw2-neuralmagic commented Aug 12, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Aug 12, 2024

jon-chuang commented Aug 12, 2024

[Feature]: DeepSeek-Coder-V2-Instruct-FP8 on 8xA100 #7322

[Feature]: DeepSeek-Coder-V2-Instruct-FP8 on 8xA100 #7322

Comments

halexan commented Aug 9, 2024 • edited Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

robertgshaw2-neuralmagic commented Aug 12, 2024

robertgshaw2-neuralmagic commented Aug 12, 2024

jon-chuang commented Aug 12, 2024

robertgshaw2-neuralmagic commented Aug 12, 2024

jon-chuang commented Aug 12, 2024

robertgshaw2-neuralmagic commented Aug 12, 2024 • edited Loading

robertgshaw2-neuralmagic commented Aug 12, 2024

jon-chuang commented Aug 12, 2024

halexan commented Aug 9, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Aug 12, 2024 •

edited

Loading