[v2] Attention Masking #352

MikeynJerry · 2023-07-20T01:48:38Z

Is any plan to add attention masking support? PyTorch's version of flash attention v1 included the ability to provide an attention mask in their implementation and it would be very useful to have this feature in v2.

leizhao1234 · 2023-07-20T02:26:18Z

In fact, when you send an attention mask to PyTorch's implementation, flash attention didn't work.

balachandarsv · 2023-07-21T07:00:50Z

Yes, facing the same issue. @tridao Can you please take a look at this and respond when you are available?

tridao · 2023-07-21T07:11:27Z

Attention mask isn't supported (either in v1 or v2). I might implement it at some point but there are other priorities now.

PeterL1n · 2023-08-03T21:53:47Z

I thought masking is supported through flash_attn_varlen_func

https://github.com/Dao-AILab/flash-attention/blob/d30f2e1cd50185c98ed88c0684b4a603f15bee37/flash_attn/flash_attn_interface.py#L454C21-L454C21

zhipeng93 · 2023-09-15T02:33:41Z

I have tested v1.0.7 and v2.0.4. The result turns out that none of them supports attention mask ---

A: using flash attention with attention mask
B: not using flash attention, with attention mask

The results of A and B are different.

samvanstroud · 2023-09-25T21:05:41Z

This paper might be relevant: https://arxiv.org/abs/2306.01160.

There are several related issues:

I believe pytorch 2.1 will have a memory efficient attention implementation that supports arbitrary masks: pytorch/pytorch#96099

defei-coder · 2023-10-20T12:57:02Z

@tridao Hello, I plan to add a bias mask in flashattention2. I noticed that in order to integrate the scale and add operations scale_apply_exp2 ,the scale is delayed until after the maximum value is calculated. I plan to support bias mask in the apply_mask_causal function, I think if a bias mask is supported, it seems that ffma optimization in scale_apply_exp2 can be cancelled. Using scale and bias can still benefit from FFMA, do you have any suggestions?

zhangyipin · 2023-11-06T07:10:04Z

flash_attn/flash_attn_triton.py support bias input
you can use bias=-inf

wehos · 2024-02-29T18:26:42Z

flash_attn/flash_attn_triton.py support bias input you can use bias=-inf

This is a good point but the example itself is not working with pytorch2.0+ (<==triton2.0+) 😭

jaanli · 2024-03-06T23:39:40Z

Anyone have tips on custom masks with flash attention for training?

(I need this to train encoder-decoder models with variable-length sequences using non-causal masks.)

This came up in a recent article: https://www.yitay.net/blog/training-great-llms-entirely-from-ground-zero-in-the-wilderness

The other striking thing is how little support these codebases have for large scale encoder-decoder training or even prefixLM training. To that end, even flash attention has consistently declined to provide support for prefixLM training (i.e., custom masks) despite reasonable demand on their github issues for whatever reason.

Curious what this would take or if it is still out of scope for the flash attention library?

Really grateful that this exists!! Just posting for visibility in case others have solved this problem :)

tridao · 2024-03-07T00:50:42Z

Curious what this would take or if it is still out of scope for the flash attention library?

Not out of scope, it's just someone needs to go implement it :D

jaanli · 2024-03-07T03:54:59Z

Understood — thank you!! Will try using the varlen functions for now :)

ardagoreci · 2024-05-26T17:57:21Z

I was wondering if there has been any updates on this? AlphaFold3 uses a lot of attention pair biasing and it would be tremendously useful to computational biology if flash attention supported attention biasing!

tridao · 2024-05-26T18:44:48Z

I was wondering if there has been any updates on this? AlphaFold3 uses a lot of attention pair biasing and it would be tremendously useful to computational biology if flash attention supported attention biasing!

Right, we still need someone to implement it.

alexzhang13 · 2024-07-13T19:54:37Z

@tridao Was wondering, what needs to be done for this to be implemented (I'm assuming efficiently? otw it seems quite simple)

I need a similar feature (arbitrary attention masks) but I figured I might take a stab at just implementing it if it still needs to be done.

alexzhang13 · 2024-07-21T07:09:31Z

I've implemented a version of custom masking for FA2 in Triton: https://github.com/alexzhang13/flashattention2-custom-mask

It suffices for my use case, but if something comes up where it's necessary to touch the FA3 code I may re-visit this.

amyxlu · 2024-08-21T22:03:59Z

Attention mask isn't supported (either in v1 or v2). I might implement it at some point but there are other priorities now.

Seems like the FlashAttention class does take in a key_padding_mask argument in its forward method. What would be the difference between this and the attention mask to be implemented? Cc @tridao. Thanks!

tridao · 2024-08-21T22:54:06Z

As you can see in the code, key_padding_mask just removes elements from keys and values before passing to the flash attention kernel. There's no attention mask passed to the kernel.

krejciadam · 2024-10-16T08:31:22Z

As you can see in the code, key_padding_mask just removes elements from keys and values before passing to the flash attention kernel. There's no attention mask passed to the kernel.

Is there any plan to support key_padding_mask in MHA in v2 ? My understanding is that this was supported in v1 (in flash_attn.flash_attention.FlashMHA), but in v2, one can only use key_padding_mask when use_flash_attn is False (in flash_attn.modules.mha.MHA). Thank you.

agshar96 · 2024-10-18T07:25:19Z

Hi Everyone,
Recently I published a paper in ENLSP Workshop@NEURips 2024, to address this problem, the paper can be found here: https://arxiv.org/pdf/2409.15097

I have the code, but its in a private repository currently, as I am still cleaning up the code. If someone wants to access this repo just send a mail to: [email protected]

Meanwhile, I realised that pytorch team already implemented a change which pretty much uses same method which I used. (I came up with my method independently for a university project the pytorch blog came around half a month after my university project).

Anyways, TL:DR - pytorch has now enabled custom masking of flash attention, you can find it here: https://pytorch.org/blog/flexattention/
(And, I am sad man, as my method will never be used)

fedebotu mentioned this issue Oct 5, 2023

Reimplementation in RL4CO wouterkool/attention-learn-to-route#58

Open

frank-xwang mentioned this issue Feb 26, 2024

question about Instance-Masked Attention frank-xwang/InstanceDiffusion#6

Closed

clownrat6 mentioned this issue Mar 10, 2024

Flash Attention doesn't support attention mask PKU-YuanGroup/Open-Sora-Plan#109

Closed

xiabingquan mentioned this issue Sep 11, 2024

Data Concatenation: how to avoid sample contamination during training VITA-MLLM/VITA#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2] Attention Masking #352

[v2] Attention Masking #352

MikeynJerry commented Jul 20, 2023 •

edited

Loading

leizhao1234 commented Jul 20, 2023

balachandarsv commented Jul 21, 2023

tridao commented Jul 21, 2023

PeterL1n commented Aug 3, 2023 •

edited

Loading

zhipeng93 commented Sep 15, 2023

samvanstroud commented Sep 25, 2023

defei-coder commented Oct 20, 2023

zhangyipin commented Nov 6, 2023 •

edited

Loading

wehos commented Feb 29, 2024 •

edited

Loading

jaanli commented Mar 6, 2024

tridao commented Mar 7, 2024

jaanli commented Mar 7, 2024

ardagoreci commented May 26, 2024

tridao commented May 26, 2024

alexzhang13 commented Jul 13, 2024

alexzhang13 commented Jul 21, 2024

amyxlu commented Aug 21, 2024

tridao commented Aug 21, 2024

krejciadam commented Oct 16, 2024

agshar96 commented Oct 18, 2024

[v2] Attention Masking #352

[v2] Attention Masking #352

Comments

MikeynJerry commented Jul 20, 2023 • edited Loading

leizhao1234 commented Jul 20, 2023

balachandarsv commented Jul 21, 2023

tridao commented Jul 21, 2023

PeterL1n commented Aug 3, 2023 • edited Loading

zhipeng93 commented Sep 15, 2023

samvanstroud commented Sep 25, 2023

defei-coder commented Oct 20, 2023

zhangyipin commented Nov 6, 2023 • edited Loading

wehos commented Feb 29, 2024 • edited Loading

jaanli commented Mar 6, 2024

tridao commented Mar 7, 2024

jaanli commented Mar 7, 2024

ardagoreci commented May 26, 2024

tridao commented May 26, 2024

alexzhang13 commented Jul 13, 2024

alexzhang13 commented Jul 21, 2024

amyxlu commented Aug 21, 2024

tridao commented Aug 21, 2024

krejciadam commented Oct 16, 2024

agshar96 commented Oct 18, 2024

MikeynJerry commented Jul 20, 2023 •

edited

Loading

PeterL1n commented Aug 3, 2023 •

edited

Loading

zhangyipin commented Nov 6, 2023 •

edited

Loading

wehos commented Feb 29, 2024 •

edited

Loading