use flash-attn via xformers #877

tmm1 · 2023-08-25T19:18:54Z

WoosukKwon · 2023-08-26T01:17:50Z

Hi @tmm1 thanks for letting us know the performance issue and submitting the PR.

While using FA2 might improve the performance, we have concerns in using it because it does not support attention bias like ALiBi, V100 GPUs, FP32 data type, and head_size 256 (which is used for GPT-J). So, to use FA2, I believe we should make a fallback option to xformers cutlass backend.

tmm1 · 2023-08-26T15:12:28Z

So, to use FA2, I believe we should make a fallback option to xformers cutlass backend.

thanks @WoosukKwon, I updated the PR to allow xformers to fallback

cc @danthe3rd

zhaoyang-star · 2023-08-29T07:34:27Z

So, to use FA2, I believe we should make a fallback option to xformers cutlass backend.

thanks @WoosukKwon, I updated the PR to allow xformers to fallback

cc @danthe3rd

Hi @tmm1 , I am very interested in your PR. I see the PR does not allow xformers to fallback.

danthe3rd · 2023-08-29T08:26:17Z

Hi, xformers maintainer here
The way it's setup in this PR allows xformers to decide which backend to use - this is also what we recommend. It will use by default Flashv2 if available, but will fallback to cutlass if it's not available (eg if using a custom bias, fp32 or v100).

zhaoyang-star · 2023-08-29T08:39:42Z

Hi @danthe3rd , thanks for your explanation. I just wonder why xformers.ops.memory_efficient_attention_forward does not been used in the decoding stage? The API has used in the prefill stage in vLLM. Hand-written kernel may have lower perf than xformers if we are not an expert of CUDA.

danthe3rd · 2023-08-29T09:23:50Z

At the moment, memory_efficient_attention is optimized the training mostly - which is similar to the prefilling stage in terms of problem sizes. We have a backend for next token decoding, but it's not fully optimized for all cases, but we're working on it :)

zhaoyang-star · 2023-08-29T09:27:42Z

At the moment, memory_efficient_attention is optimized the training mostly - which is similar to the prefilling stage in terms of problem sizes. We have a backend for next token decoding, but it's not fully optimized for all cases, but we're working on it :)

Thanks a lot. So it makes sense that most llm inference framworks have hand-written cuda kernel for fused attention impl.

zhuohan123

LGTM! Thank you for your contribution!

KexinFeng · 2023-09-14T01:01:54Z

Hi @tmm1 @WoosukKwon
I have a follow up question to the question

I just wonder why xformers.ops.memory_efficient_attention_forward does not been used in the decoding stage?

Is flash attention (or similar algorithms where softmax is calculated streamingly with fused kernal) also implemented inside vllm.attention_ops.single_query_cached_kv_attention? It looks to me that, in principle, flash attention algorithm is very compatible with paged attention, meaning that the softmax can in principle be computed streamingly and with a fused kernal, in paged storage of tensor, too.

Lvjinhong · 2023-12-25T04:33:38Z

For allow xformers to pick the best available implementation, I don't quite understand this change, so how should I use flash?

use xformers flash-attn

73e4e60

WoosukKwon self-requested a review August 26, 2023 01:14

allow xformers to pick best available implementation

4fcdddb

zhuohan123 approved these changes Aug 30, 2023

View reviewed changes

zhuohan123 merged commit 7547138 into vllm-project:main Aug 30, 2023
2 checks passed

liuyanyi pushed a commit to liuyanyi/vllm that referenced this pull request Sep 12, 2023

use flash-attn via xformers (vllm-project#877)

c7b534d

KexinFeng mentioned this pull request Sep 26, 2023

[Question] Flash attention only applies to prefilling stage ModelTC/lightllm#147

Open

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

use flash-attn via xformers (vllm-project#877)

7cf85f3

sjchoi1 pushed a commit to casys-kaist-internal/vllm that referenced this pull request May 7, 2024

use flash-attn via xformers (vllm-project#877)

dbbb364

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use flash-attn via xformers #877

use flash-attn via xformers #877

tmm1 commented Aug 25, 2023

WoosukKwon commented Aug 26, 2023 •

edited

Loading

tmm1 commented Aug 26, 2023 •

edited

Loading

zhaoyang-star commented Aug 29, 2023

danthe3rd commented Aug 29, 2023

zhaoyang-star commented Aug 29, 2023

danthe3rd commented Aug 29, 2023 •

edited

Loading

zhaoyang-star commented Aug 29, 2023 •

edited

Loading

zhuohan123 left a comment

KexinFeng commented Sep 14, 2023 •

edited

Loading

Lvjinhong commented Dec 25, 2023

use flash-attn via xformers #877

use flash-attn via xformers #877

Conversation

tmm1 commented Aug 25, 2023

WoosukKwon commented Aug 26, 2023 • edited Loading

tmm1 commented Aug 26, 2023 • edited Loading

zhaoyang-star commented Aug 29, 2023

danthe3rd commented Aug 29, 2023

zhaoyang-star commented Aug 29, 2023

danthe3rd commented Aug 29, 2023 • edited Loading

zhaoyang-star commented Aug 29, 2023 • edited Loading

zhuohan123 left a comment

Choose a reason for hiding this comment

KexinFeng commented Sep 14, 2023 • edited Loading

Lvjinhong commented Dec 25, 2023

WoosukKwon commented Aug 26, 2023 •

edited

Loading

tmm1 commented Aug 26, 2023 •

edited

Loading

danthe3rd commented Aug 29, 2023 •

edited

Loading

zhaoyang-star commented Aug 29, 2023 •

edited

Loading

KexinFeng commented Sep 14, 2023 •

edited

Loading