-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder models #9559
Conversation
Pull from head
Thanks for the review. Addressed comments. PTAL |
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for your hardwork on this. Looking forward for the follow-up PRs for test_encoder_decoder_attention
and mllama support.
Also CC @WoosukKwon. You may need to sync this PR to v1 later.
@ywang96 PTAL when you get a chance. PR has been LG'ed by @heheda12345 , is synced to head and all tests are passing. |
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this great work!
…odels (vllm-project#9559) Signed-off-by: Linkun Chen <[email protected]>
…odels (vllm-project#9559) Signed-off-by: Richard Liu <[email protected]>
…odels (vllm-project#9559) Signed-off-by: Loc Huynh <[email protected]>
…odels (vllm-project#9559) Signed-off-by: Sumit Dubey <[email protected]>
This PR adds support for flash attention kernel for encoder decoder models. For encoder-decoder models with dtype=bfloat16 the default backend choice is now FlashAttention instead of XFormers. However for llama-3.2-11b-vision-instruct we still use the Xformers backend even with dtype=bfloat16 because the model implementation (models/mllama.py) has dependency on PagedAttention.
For adding this support, we make the following changes in this pr
#7366