-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: DeepSeek-Coder-V2-Instruct-FP8 on 8xA100 #7322
Comments
There is not currently a workaround for this. We have been working on extending see: #7079 for progress of marlin fused_moe |
Closing for now. |
Hello @robertgshaw2-neuralmagic , may I ask why an FP8 quantized model would used an FP16XINT4 mm kernel? Could you point to some resources or blog post about this? Thank you. |
Marlin is a mixed precision inference kernel. It supports int4 weights, int8 weights, and fp8 weights with 16 bit activations (for dense models) we started by extending marlin to support fused moe with int4 and int8 weights and fp16 activations (the pr I linked). A follow up to this will be extending to support fp8 weights as well |
At what batch size does Marlin become optimal (I.e. roofline) for FP8? |
I’m not sure I follow the question. The Roofline analysis shows the latency of the kernel as a function of batch size. Marlin GEMM is a highly optimized kernel that was designed to address performance issues with the prior generation of mixed precision kernels which did not perform well in the batch 8-64 range even though the computation is memory bound. So, marlin follows the roofline plot very well. But, you should not expect marlin to accelerate compute bound workloads over fp16. For compute bound workloads we recommend using activation quantization |
One follow up - If you’re running on Hopper, I don’t think it makes sense to use marlin for fp8 since we can use dyanmic activation quantization with high accuracy. The only use of marlin fp8 IMO should be for devices which do not support fp8 compute (I.e a100) |
I see, thank you for the detailed response! |
🚀 The feature, motivation and pitch
VLLM has announced support for running llama3.1-405b-fp8 on 8xA100. This is the blog
Does vllm support running DeepSeek-Coder-V2-Instruct-FP8 on 8xA100?
However, I notice that vLLM uses Triton for its FusedMoE kernel, which doesn't support the FP8 Marlin mixed-precision. See sgl-project/sglang#989 (comment)
Is there any work around?
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: