-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]: FLASHINFER backend is slower than FLASH_ATTN on H100 #9471
Comments
Since the attention computation is still in FP16, could you benchmark with the original BF16 data type and see if there's still a gap? This could help locate the problem more precisely. |
Maybe useful info: flashinfer-ai/flashinfer#521 |
Thanks @jeejeelee but that issue related to prefill performance. A quick look using torch profiler indicates that the majority of time is spent in decode kernel for both backends: using
using
So it really seems like the Flashinfer decode kernel is slower than FA equivalent. |
@comaniac Sure, here are the bf16 results, as well as some other datapoints we have collected: The column It looks like the heuristic to determine when to enable the tensor cores isn't working well for this model: vllm/vllm/attention/backends/flashinfer.py Line 127 in 1ffc8a7
Kudos to my colleague @cyang49 for discovering this! |
This issue seems relevant: flashinfer-ai/flashinfer#520 It sounds like setting |
@tdoublep @jeejeelee @cyang49 Thank you all for the investigation, and yes I do think the original heuristics doesn't work for fp8. |
Closed via #9497 |
Misc discussion on performance
TLDR: We are observing that FP8 throughput is significantly lower when using
FLASHINFER
backend vs. using the default backend (FLASH_ATTN
) forllama3.1-8b
on a single H100 usingv0.6.4.dev22+g5b8a1fde
.Here is a simple repo script:
Running using
FLASH_ATTN
backend:whereas running using
FLASHINFER
backend:From reading the FlashInfer blog, I don't think these results are expected. It is a shame because we would really like to use FlashInfer to pick up the FP8 KV cache feature.
Your current environment (if you think it is necessary)
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: