-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should I be getting more speedup/memory reduction from FlashAttention2 with Mistral? #27329
Comments
Hi @cassianlewis |
Hi @younesbelkada, thanks for the reply.
ResultsSlight improvements in time/memory. FYI the input is ~2k tokens (but no padding) so this may benefit from longer input sequences. I then tested it with This is more in line with what you posted in #26464 (comment) So it looks like, for decoding at least, the speedups are fairly minimal for the kind of input sequence lengths/batch sizes I'm using. |
Thanks a lot for this benchmark ! Yes I think this result is pretty much inline with my findings. |
System Info
transformers: 4.35.0
python: 3.9.13
Who can help?
@SunMarc
@younesbelkada
@Gant
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Setup model
Run code for different batch sizes
Expected behavior
Results
Very little speedup/memory improvement:
Profiling
With FA2:
Without FA2
Would expect better performance given these
The text was updated successfully, but these errors were encountered: