-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use flash-attn via xformers #877
Conversation
Hi @tmm1 thanks for letting us know the performance issue and submitting the PR. While using FA2 might improve the performance, we have concerns in using it because it does not support attention bias like ALiBi, V100 GPUs, FP32 data type, and head_size 256 (which is used for GPT-J). So, to use FA2, I believe we should make a fallback option to xformers cutlass backend. |
thanks @WoosukKwon, I updated the PR to allow xformers to fallback cc @danthe3rd |
Hi @tmm1 , I am very interested in your PR. I see the PR does not allow xformers to fallback. |
Hi, xformers maintainer here |
Hi @danthe3rd , thanks for your explanation. I just wonder why |
At the moment, |
Thanks a lot. So it makes sense that most llm inference framworks have hand-written cuda kernel for fused attention impl. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for your contribution!
Hi @tmm1 @WoosukKwon
Is flash attention (or similar algorithms where softmax is calculated streamingly with fused kernal) also implemented inside |
For allow xformers to pick the best available implementation, I don't quite understand this change, so how should I use flash? |
fixes #485 (comment)