Masking + biasing #17

gahdritz · 2022-07-03T01:26:52Z

How difficult would it be to modify the kernels to support arbitrary masks/additive biases for the attention logits, especially w/ support for tensor broadcasting? Is there any fundamental reason why that wouldn't work? I noticed that the FlashAttention class has an "attn_mask" parameter, but it doesn't let you specify it.

On an unrelated note, would it be worth adding FP32 support for inference?

tridao · 2022-07-03T01:52:51Z

We plan to support additive biases (e.g. ALiBi). The bias should ideally take linear memory and not quadratic memory.
Can you explain what shape the arbitrary masks would be? What does tensor broadcasting mean in this case?

Re FP32: We use tensor cores for matrix multiply, which can support fp16 (and bf16 in the near future). The xformers team have some implementation of memory-efficient attention in fp32 (here and here).

gahdritz · 2022-07-03T02:01:19Z

Thanks for the speedy response!

I'm thinking of applying FlashAttention to our implementation of AlphaFold 2, which has a number of different attention modules with different biases for the pre-softmax quadratic attention matrix S = Q @ K^T. To save memory, the biases are deliberately designed to be smaller than the full e.g. [B, H, N, N] S tensor---they might have shapes like [1, 1, N, N], or [1, 1, 1, N], and so on, and are only broadcast to S's size at the moment of addition.

BTW, is there a reason the existing attn_mask parameter is disabled? Would it be straightforward to get it re-enabled again?

tridao · 2022-07-03T02:09:14Z

Can you point me to the code / paper on how these biases are applied? Are they just added to S (potentially with broadcasting)? Are they fixed or trainable (i.e. do we need to compute their gradient in the backward pass)?

attn_mask is not implemented (yet), it's just there in the Python code to be compatible with our internal testing code.
We do support key_padding_mask.

gahdritz · 2022-07-03T02:17:20Z

See e.g. our implementation of what DeepMind calls "triangular attention." Pseudocode on page 19 here. It uses both a fixed mask and a trainable bias (mask_bias and triangle_bias in the code, respectively). Both are simply added to S with broadcasting.

tridao · 2022-07-03T02:47:19Z

Thanks for the pointers!
Is the mask_bias used to mask out the keys (it should have shape [1, 1, 1, N]), or to mask out arbitrary entries in the attention matrix (shape [1, 1, N, N])?
How large is the sequence length N typically?

gahdritz · 2022-07-03T03:13:16Z

In this particular case, the mask_bias masks out the keys (but is unique for each element in the (sub-)batch, so it's of shape [B, 1, 1, N]). The triangle_bias, however, is quadratic. N in this case usually ranges from 0-2000, but can go as high as ~10000.

gahdritz · 2022-07-04T05:12:42Z

Sorry I forgot to ask yesterday---do you have an approximate ETA for the attn_mask feature? Even without the additive bias feature, FlashAttention would give us a huge speedup in certain modules---we just need that mask.

Thanks!

tridao · 2022-07-04T05:20:59Z

Do you mean a key padding mask (of shape [B, 1, 1, N]), or an arbitrary attention mask of shape [1, 1, N, N]?
We already support key padding mask ([B, 1, 1, N]), query padding mask ([B, 1, N, 1]) and causal mask (for autoregressive modeling).

gahdritz · 2022-07-04T05:30:52Z

I tried playing around with the key_padding_mask feature, and as I understand it, in addition to masking out keys, it also masks out the corresponding queries and values (it stacks them all and then uses the unpad_input function to unpad all three at once). Is that incorrect? I might be looking in the wrong place, since I didn't see a specific query padding mask option. I'm looking for the ability to use the [B, 1, 1, N] mask to mask (i.e. set to negative infinity before the softmax) columns of S.

tridao · 2022-07-04T05:36:10Z

I've just finished implementing cross-attention where we can support separate query padding and key padding (here and here).
That is, you can all unpad_input on the query and then unpad_input on the keys and values separately.

I'll push an example usage tomorrow.

gahdritz · 2022-07-04T05:36:27Z

Oh awesome. Thanks!

void-main · 2022-07-05T09:21:16Z

We want customizable masking & biasing as well! Adding these two features would make FlashAttention suitable for a lot more models.

OlivierDehaene · 2022-09-23T16:03:33Z

Hello @tridao,
Do you still plan on adding support for additive biases? I am currently working on speeding up inference for BLOOM and having the possibility to use flash attention would be great!

tridao · 2022-09-23T17:39:11Z

Yes, it'll be there eventually. I just haven't had as munch bandwidth to work on it recently (conference deadline).

By inference, do you mean text generation? Would Q have seqlen 1 most of the time, due to the kv_cache? In that case I think there might not be much benefit to using FlashAttention (attention is not IO-bound).
We have ideas on speeding up text generation in this regime, but that will take a bit of time.

guolinke · 2022-10-13T01:41:04Z

refer to #57

zeliu98 · 2022-11-05T08:42:49Z

Hi @tridao, congratulations on your great work!

I have the same issue for using flash attention in Swin-Transformer, especially in shifted window attention.
It will be appreciated if flash attention can support customizing attention mask and bias.

To be more specific, the shapes in the window attention are:
q, k, v: [batch-size, nheads, ntokens, dim]
[email protected](): [batch-size, nheads, ntokens, ntokens]
mask: [batch-size, 1, ntokens, ntokens]
bias: [1, nheads, ntokens, ntokens]

The final attn weights (before softmax) should be: attn + mask + bias

MayDomine · 2022-11-16T10:16:37Z

refer to #76

calebthomas259 · 2022-12-16T07:45:16Z

Hello @tridao, flash-attention is amazing! Thank you so much for making it!

If possible, I'd also like to request a fully customisable attention bias (i.e. shape [B, H, N, N]). This would allow for implementing papers like Graphormer (e.g. see Figure 1 of the linked paper).

Thank you for all of your hard work!

MayDomine · 2023-01-04T02:53:08Z

Hello @tridao, flash-attention is amazing! Thank you so much for making it!

If possible, I'd also like to request a fully customisable attention bias (i.e. shape [B, H, N, N]). This would allow for implementing papers like Graphormer (e.g. see Figure 1 of the linked paper).

Thank you for all of your hard work!

Actually You can simply use the triton version implmentation to achieve this. But for backward you have to modify the origin code .And the performance will decrease because you have to save bias grad in hbm during backward.

calebthomas259 · 2023-01-10T12:24:03Z

Hello @tridao, flash-attention is amazing! Thank you so much for making it!
If possible, I'd also like to request a fully customisable attention bias (i.e. shape [B, H, N, N]). This would allow for implementing papers like Graphormer (e.g. see Figure 1 of the linked paper).
Thank you for all of your hard work!

Actually You can simply use the triton version implmentation to achieve this. But for backward you have to modify the origin code .And the performance will decrease because you have to save bias grad in hbm during backward.

I did try this at some point, but I was getting errors (I'm not sure whether it was my code being wrong, or just triton bugs). I'll probably try again once triton is more stable

UCC-team · 2023-08-17T18:09:11Z

I've just finished implementing cross-attention where we can support separate query padding and key padding (here and here). That is, you can all unpad_input on the query and then unpad_input on the keys and values separately.

I'll push an example usage tomorrow.

Where can we find the example?
Thank you for all of your hard work!

tridao · 2023-08-17T18:12:50Z

You can look at our BERT implementation.

UCC-team · 2023-08-29T09:39:35Z

You can look at our BERT implementation.

If convert a sparse mask to cu_seqlens and max_seqlen using unpad_input, we will get incorrect results. The results of attention_ref(q, k, v, q_mask, k_mask) and flash_attn_varlen_func(q, k, v, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k) do not match.
(The sparse mask is not a padding mask; it might appear in the middle rather than the end.)

_, _, cu_seqlens_q, max_seqlen_q = unpad_input(q, q_mask)
_, _, cu_seqlens_k, max_seqlen_k = unpad_input(k, k_mask)

How should we solve this problem？Thanks!

tridao · 2023-08-29T09:56:38Z

If the mask is in the middle of the sequence, that's not supported right now.

UCC-team · 2023-08-31T09:42:41Z

If the mask is in the middle of the sequence, that's not supported right now.

Can the blockmask be used as any mask mentioned above in the flash_blocksparse_attn_func interface?
flash_blocksparse_attn_func

…lt (Dao-AILab#17)

gahdritz changed the title ~~Masking~~ Masking + biasing Jul 3, 2022

jaak-s mentioned this issue Jan 21, 2023

Question: is it possible to implement flash attention with keops getkeops/keops#286

Closed

z562 mentioned this issue Feb 8, 2023

An illegal memory access was encountered #121

Closed

samvanstroud mentioned this issue Sep 25, 2023

[v2] Attention Masking #352

Open

wwbitejotunn pushed a commit to wwbitejotunn/flash-attention that referenced this issue Mar 28, 2024

Add num_splits in flash_attn backward api to support determistic resu…

e6b9d0d

…lt (Dao-AILab#17)

ddaspit mentioned this issue Aug 9, 2024

Add support for SDPA to NLLB in Huggingface Transformers sillsdev/silnlp#478

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Masking + biasing #17

Masking + biasing #17

gahdritz commented Jul 3, 2022 •

edited

Loading

tridao commented Jul 3, 2022

gahdritz commented Jul 3, 2022

tridao commented Jul 3, 2022

gahdritz commented Jul 3, 2022 •

edited

Loading

tridao commented Jul 3, 2022

gahdritz commented Jul 3, 2022 •

edited

Loading

gahdritz commented Jul 4, 2022

tridao commented Jul 4, 2022

gahdritz commented Jul 4, 2022 •

edited

Loading

tridao commented Jul 4, 2022

gahdritz commented Jul 4, 2022

void-main commented Jul 5, 2022

OlivierDehaene commented Sep 23, 2022

tridao commented Sep 23, 2022

guolinke commented Oct 13, 2022

zeliu98 commented Nov 5, 2022

MayDomine commented Nov 16, 2022

calebthomas259 commented Dec 16, 2022

MayDomine commented Jan 4, 2023 •

edited

Loading

calebthomas259 commented Jan 10, 2023

UCC-team commented Aug 17, 2023

tridao commented Aug 17, 2023

UCC-team commented Aug 29, 2023

tridao commented Aug 29, 2023

UCC-team commented Aug 31, 2023

Masking + biasing #17

Masking + biasing #17

Comments

gahdritz commented Jul 3, 2022 • edited Loading

tridao commented Jul 3, 2022

gahdritz commented Jul 3, 2022

tridao commented Jul 3, 2022

gahdritz commented Jul 3, 2022 • edited Loading

tridao commented Jul 3, 2022

gahdritz commented Jul 3, 2022 • edited Loading

gahdritz commented Jul 4, 2022

tridao commented Jul 4, 2022

gahdritz commented Jul 4, 2022 • edited Loading

tridao commented Jul 4, 2022

gahdritz commented Jul 4, 2022

void-main commented Jul 5, 2022

OlivierDehaene commented Sep 23, 2022

tridao commented Sep 23, 2022

guolinke commented Oct 13, 2022

zeliu98 commented Nov 5, 2022

MayDomine commented Nov 16, 2022

calebthomas259 commented Dec 16, 2022

MayDomine commented Jan 4, 2023 • edited Loading

calebthomas259 commented Jan 10, 2023

UCC-team commented Aug 17, 2023

tridao commented Aug 17, 2023

UCC-team commented Aug 29, 2023

tridao commented Aug 29, 2023

UCC-team commented Aug 31, 2023

gahdritz commented Jul 3, 2022 •

edited

Loading

gahdritz commented Jul 3, 2022 •

edited

Loading

gahdritz commented Jul 3, 2022 •

edited

Loading

gahdritz commented Jul 4, 2022 •

edited

Loading

MayDomine commented Jan 4, 2023 •

edited

Loading