Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Flash attention only applies to prefilling stage #147

Open
KexinFeng opened this issue Sep 24, 2023 · 2 comments
Open

[Question] Flash attention only applies to prefilling stage #147

KexinFeng opened this issue Sep 24, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@KexinFeng
Copy link

KexinFeng commented Sep 24, 2023

I have a question arising from reading the code. I notice that in ~/lightllm/models/llama2/layer_infer/transformer_layer_infer.py, the flash attention is only applied to the prefilling stage, i.e. the context_attention_fwd, but not to the decoding stage, i.e. token_att_fwd. Am I correct in this understanding?

In principle, token attention doesn't conflict with flash attention. Do you plan to combine them both in the decoding stage?

Also, what is the obstacle of directly using flash attention repo with the token-level memory management?

@KexinFeng KexinFeng added the bug Something isn't working label Sep 24, 2023
@hiworldwzj
Copy link
Collaborator

@KexinFeng We try to implement this triton kernel that use flash attention。but currently it is not fast enough。

@KexinFeng
Copy link
Author

KexinFeng commented Sep 26, 2023

@hiworldwzj I'm actually exploring the same thing. It looks to me that, in principle, flash attention is completely orthogonal to the token-wise memory management (flash attention is in essence a streaming way of attention computation, which on the paper is naturally compatible with the token-wise memory management), so the acceleration effect is supposed to be directly added on top it. It is a little surprising that "it is not fast enough".

vllm is actually working on exactly the same thing:
vllm-project/vllm#877 (comment)

I'm wondering if you have any work-in-progress implementation to share, so that the community can contribute?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants