After switching to Flash Attention, why do we not mask the attention?? Wouldn't it behave like en encoder? #65
Replies: 1 comment
-
Got it! the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We register a masking bias when coding our attention block to avoid future tokens from communicating. (Acting as a decoder). However, when we switch to using Flash Attention,
torch.nn.functional.scaled_dot_product_attention (q, k, v)
we do not assign anything to theattn_mask
attribute which is set toNone
by default. Will this not make it behave like an encoder now?Am I missing something or no one noticed it?
Beta Was this translation helpful? Give feedback.
All reactions