-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce flash-attn (>= 2.5.0). #3010
Conversation
Polite ping @WoosukKwon I just noticed you have the working branch Thanks! |
The entrypoint test failure shouldn't be caused by this PR. |
Glad to see porting FA to vLLM. FYI, there is already has a similar PR #2744. As FlashInfer is faster than FlashAttention, vLLMers may prefer FlashInfer than FA. |
776b60b
to
6175efd
Compare
Thanks for your work! As paged KV cache block size in Flash Attention must be divisible by 256. That is a big difference from the block_size=16 in vLLM. could you share the latency and throughput benchmark data? Does it cause side-effect? |
Hi @zhaoyang-star I have updated the description for numbers (the same setting with #2744, thank you!). Note that the current throughput benchmark may not reflect the real-world situations (as eos is ignored and batches are aligned). More benchmark data on unaligned, continuous batch will be added soon. |
@sighingnow The speedup of throughput is ~3% from your benchmark. How about the e2e latency? |
Still running. I have added new data for speculative decoding, which means unaligned input sequences inside a batch, and parallel decoding requirement for the main model. |
@sighingnow I tested the e2e latency using Flash-Attention in vLLM. Performance boost can only be seen on large batchsize. |
For input=512 and output=512, the improvement smaller than the numbers reported in #2744 should because I didn't use tensor parallel to accelerate these MLPs (which is more costly for shorter context size and GPTQ may amplify it). |
Hello, just wanted to provide a quick update on Dao-AILab/flash-attention#824 - the PR is complete, flash attention can now support page sizes as low as 16 |
Thank you! |
@zhaoyang-star I have uploaded more data and analysis to the pull request description. Hope that could be helpful. |
6175efd
to
dc368e9
Compare
Hi @zhaoyang-star, I have observed the side effect of larger block size on saving key/value vectors into kv-cache (the side effect only affects prefill, and exists for both xformers kernel and flash-attn's kernel). I have incldued data in the pull request description. After @skrider's PR been merged, I would take another try on the new version of flash-attention. |
dc368e9
to
afa4639
Compare
Hi @WoosukKwon @zhuohan123 I would like to know if you folks have any comment on this pull requests? Integrating changes in this PR into #3005 is also fine for me. |
Signed-off-by: Tao He <[email protected]>
afa4639
to
5cbeb04
Compare
Closing as it is superseded by #3005. |
flash-attention starts support paged kv-cache since v2.5.0 (commit Dao-AILab/flash-attention@54e80a3) in the
flash_attn_with_kvcache
interface.This PR enables the usage of flash-attn kernels for both prefilling, contexted-prefill (current
context_attention_fwd
kernel for prefix cache) and decoding, for both MHA and MQA/GQA.(This PR includes GQA fixes for prefix cache in #3007 which should be accepted/merged first).
Takeaways
(Correct me if I made something wrong in the following evaluation.)
memory_efficient_attention_forward
.Benchmarks
Kernel performance
benchmarks/kernel/benchmark_paged_attention.py
(on A800-SXM4-80GB):benchmarks/kernels/benchmark_attention.py
(on A800-SXM4-80GB):The
no kv-cache
version is thecache_ops.reshape_and_cache
been disabled inattention.py
.FA
is the version usingflash_attn_func
for prefill and tensor's indexing operation to update kv-cache,FA-kvcache
is using the kernelflash_attn_with_kvcache
itself to update the kv-cache.Throughput
benchmark_throughput.py
onLlama-70B-GPTQ
(on A800-SXM4-80GB):Here the "Speed (seconds/round)" is the averaged duration for per 10 prompt/decodings runs. We can see that
Real-world cases throughput
flash_attn_with_kvcache
andcontext_attention_fwd
):