Integrate FlashAttention into HF OPT #18439

erichan1 · 2022-08-02T22:56:49Z

Integrate FlashAttention.

Requires flash_attention integration pytorch/pytorch#81434 to work. torch._scaled_dot_product_attention is only there.
Turn on fast path or go back to slow path using fast_attention=True/False flag.
Turn on causal mask or turn it off for the fast attention path using fast_attention_causal = True/False.
Does not support attention mask or padding mask on the fast path.
Currently requires us to do an unnecessary conversion to Nestedtensor and back because the current FlashAttn implementation only takes NestedTensor. Will remove once torch._scaled_dot_product_attention supports regular tensor.

HuggingFaceDocBuilderDev · 2022-08-02T23:06:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

jbschlosser · 2022-08-04T15:28:03Z

src/transformers/models/opt/modeling_opt.py

+            query_states_fast = torch.nested_tensor(torch.unbind(query_states, dim=0))
+            key_states_fast = torch.nested_tensor(torch.unbind(key_states_fast, dim=0))
+            value_states_fast = torch.nested_tensor(torch.unbind(value_states_fast, dim=0))


won't this result in padding within the nested tensors?

i.e. if query_states is a padded rectangular tensor, calling unbind on it will produce sequences padded to the same length, so we won't be taking advantage of nested tensors to reduce padding. Or am I missing something?

I'm assuming only 0 padding tensor inputs for now. This is a hack just to make those tensors into NestedTensors because currently FlashAttention SDP requires NestedTensor. If FlashAttn SDP supported regular tensor I would just remove these entirely.

gotcha, thanks for the clarification :)

github-actions · 2022-09-02T15:01:42Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

fzyzcjy · 2022-12-30T12:35:04Z

Hi, is there any updates? Coming from https://github.com/HazyResearch/flash-attention/blob/main/usage.md

puyuanOT · 2023-04-17T00:27:08Z

Looking forward to the update!

erichan1 · 2023-04-17T04:48:26Z

Looking forward to the update!

Hey there @puyuanOT! Not working on this actively anymore. Check out torch SDP to use FlashAttn in native torch!

puyuanOT · 2023-04-17T15:26:59Z

Thanks @erichan1 ! I will check it out.

vincentmin · 2023-04-24T12:06:56Z

@erichan1 Could you explain the reason for stopping to work on this feature? I think it would be a great implementation for the transformers library.
Regarding the torch SDP link, could you give instructions on how to use this torch feature when using a model in Huggingface transformers?

Edit: Is it the case that flash attention is now activated by default with recent versions of torch? If so, I would recommend a HuggingFace blog article to advertise this feature and explain its workings. Currently documentation is rather lacking on flash-attention support.

amyeroberts · 2023-04-24T13:13:01Z

Within the Hugging Face ecosystem, it's possible to use BetterTransformer and the optimum library to improve model performance: [1], [2]. @younesbelkada Is flash attention available yet through this?

erichan1 · 2023-04-24T17:21:40Z

@amyeroberts @vincentmin I'm from the PyTorch team. We decided that the best way to provide FlashAttention was to create a new module that was just the component FlashAttention covers, Scaled Dot Product Attention. This is the part which does softmax(Q@K)@v, and doesn't include the in projection and out projection. Since we built this abstraction, we also decided that we could use it to offer some other implementations of SDP, including a memory efficient one that we've built in house which uses less memory than FlashAttn, but is slower.

You can just directly use SDP by replacing the necessary chunk of code in your transformer definition. But I'm unsure about a way to use it with a flag you flip in HuggingFace. I'll let @younesbelkada speak to that. I believe BetterTransformer and SDP (which is part of BetterTransformer) support is already part of Optimum.

vincentmin · 2023-04-25T07:48:18Z

@erichan1 @amyeroberts Thank you for the clarifications. I now understand that BetterTransformer should offer the features I am looking for. I encourage you to write a blog post on Huggingface to advertise this to the world!

younesbelkada · 2023-04-25T08:29:21Z

Hi @erichan1 @amyeroberts @vincentmin
This is correct, SDPA is now part of the optimum's BetterTransformer API, however this is only available for decoder-based models right now.
We are indeed panning to write a blogpost soon with Pytorch to publicly announce the feature soon. We will keep you posted here!

KatarinaYuan · 2023-06-14T03:52:31Z

Hi, any recent updates on this blogpost for BetterTransformer that you mentioned earlier?

younesbelkada · 2023-06-14T07:33:43Z

Hi @KatarinaYuan
Yes the blogpost is out and is here: https://pytorch.org/blog/out-of-the-box-acceleration/

KatarinaYuan · 2023-06-14T20:15:22Z

Thank you!

…

On Jun 14, 2023, at 3:33 AM, Younes Belkada ***@***.*** ***@***.***>> wrote: Hi @KatarinaYuan <https://github.com/KatarinaYuan> Yes the blogpost is out and is here: https://pytorch.org/blog/out-of-the-box-acceleration/ <https://pytorch.org/blog/out-of-the-box-acceleration/> — Reply to this email directly, view it on GitHub <#18439 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKL7G2YPEJ4QVA2DTHM5EBDXLFSOJANCNFSM55M2CJGA>. You are receiving this because you were mentioned.

ASR-SCI · 2023-06-29T03:27:08Z

I use the transformer trainer + FSDP llama training options, model cannot be saved, and unable to use bettertransformer.reverse() convert to original model. I don't know how to deal with this problem.

EwoutH · 2023-07-18T10:32:43Z

Are there any updates on the integration of FlashAttention into HuggingFace Transformers?

younesbelkada · 2023-07-18T10:45:43Z

@EwoutH
Flashattention should be used as a backend for torch.SDPA which is itself integrated into BetterTransformer API. Make sure to install the latest transformers and optimum libraries and run:

model = model.to_bettertransformer()

Check the blogpost: https://pytorch.org/blog/out-of-the-box-acceleration/ for reference

cc @fxmarty as well

tmm1 · 2023-08-01T01:27:53Z

is BetterTransformer up to date with FlashAttention v2?

fxmarty · 2023-08-01T08:20:21Z

Hi, BetterTransformer integrates with PyTorch SDPA (for now), and PyTorch has not integrated flash v2 yet: pytorch/pytorch#105602. Hopefully it will be there in Pytorch 2.1.

erichan1 added 2 commits July 25, 2022 23:09

make call to fast attention

0c568c8

make working flashattention integration

14448dd

erichan1 changed the title ~~Erichan1/flashatt opt~~ Integrate FlashAttention into HF OPT Aug 2, 2022

jbschlosser reviewed Aug 4, 2022

View reviewed changes

github-actions bot closed this Sep 10, 2022

pseudotensor mentioned this pull request Apr 28, 2023

Llama flash attn h2oai/h2ogpt#86

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate FlashAttention into HF OPT #18439

Integrate FlashAttention into HF OPT #18439

erichan1 commented Aug 2, 2022

HuggingFaceDocBuilderDev commented Aug 2, 2022

jbschlosser Aug 4, 2022

erichan1 Aug 4, 2022

jbschlosser Aug 4, 2022

github-actions bot commented Sep 2, 2022

fzyzcjy commented Dec 30, 2022

puyuanOT commented Apr 17, 2023

erichan1 commented Apr 17, 2023

puyuanOT commented Apr 17, 2023

vincentmin commented Apr 24, 2023 •

edited

Loading

amyeroberts commented Apr 24, 2023

erichan1 commented Apr 24, 2023

vincentmin commented Apr 25, 2023

younesbelkada commented Apr 25, 2023

KatarinaYuan commented Jun 14, 2023

younesbelkada commented Jun 14, 2023

KatarinaYuan commented Jun 14, 2023 via email

ASR-SCI commented Jun 29, 2023

EwoutH commented Jul 18, 2023

younesbelkada commented Jul 18, 2023 •

edited

Loading

tmm1 commented Aug 1, 2023

fxmarty commented Aug 1, 2023

Integrate FlashAttention into HF OPT #18439

Integrate FlashAttention into HF OPT #18439

Conversation

erichan1 commented Aug 2, 2022

HuggingFaceDocBuilderDev commented Aug 2, 2022

jbschlosser Aug 4, 2022

Choose a reason for hiding this comment

erichan1 Aug 4, 2022

Choose a reason for hiding this comment

jbschlosser Aug 4, 2022

Choose a reason for hiding this comment

github-actions bot commented Sep 2, 2022

fzyzcjy commented Dec 30, 2022

puyuanOT commented Apr 17, 2023

erichan1 commented Apr 17, 2023

puyuanOT commented Apr 17, 2023

vincentmin commented Apr 24, 2023 • edited Loading

amyeroberts commented Apr 24, 2023

erichan1 commented Apr 24, 2023

vincentmin commented Apr 25, 2023

younesbelkada commented Apr 25, 2023

KatarinaYuan commented Jun 14, 2023

younesbelkada commented Jun 14, 2023

KatarinaYuan commented Jun 14, 2023 via email

ASR-SCI commented Jun 29, 2023

EwoutH commented Jul 18, 2023

younesbelkada commented Jul 18, 2023 • edited Loading

tmm1 commented Aug 1, 2023

fxmarty commented Aug 1, 2023

vincentmin commented Apr 24, 2023 •

edited

Loading

younesbelkada commented Jul 18, 2023 •

edited

Loading