After switching to Flash Attention, why do we not mask the attention?? Wouldn't it behave like en encoder? #65

TejasKalsait · 2024-07-26T17:39:08Z

TejasKalsait
Jul 26, 2024

We register a masking bias when coding our attention block to avoid future tokens from communicating. (Acting as a decoder). However, when we switch to using Flash Attention, torch.nn.functional.scaled_dot_product_attention (q, k, v) we do not assign anything to the attn_mask attribute which is set to None by default. Will this not make it behave like an encoder now?

Am I missing something or no one noticed it?

TejasKalsait · 2024-07-26T20:32:21Z

TejasKalsait
Jul 26, 2024
Author

Got it! the is_casual = True applies the mask in the background.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After switching to Flash Attention, why do we not mask the attention?? Wouldn't it behave like en encoder? #65

{{title}}

Replies: 1 comment

{{title}}

Select a reply

After switching to Flash Attention, why do we not mask the attention?? Wouldn't it behave like en encoder? #65

TejasKalsait Jul 26, 2024

Replies: 1 comment

TejasKalsait Jul 26, 2024 Author

TejasKalsait
Jul 26, 2024

TejasKalsait
Jul 26, 2024
Author