Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: Unified flashattn kernel not outperforming current one #10707

Open
1 task done
NickLucche opened this issue Nov 27, 2024 · 0 comments
Open
1 task done
Labels
performance Performance-related issues

Comments

@NickLucche
Copy link
Contributor

Proposal to improve performance

While working on #9291, I experimented with unifying prefills and decodes processing in a single forward call (through the flash_attn_varlen_func API), while currently we separate the two by "splitting" the flattened 1d tokens tensor (size n_prefills+n_decodes).
The unification is meaningful when chunked prefill is enabled, as it will allow mixed prefill-decodes batches to be scheduled.

Following the change, @sroy745 found no speedup in his benchmarks with the new version using a single kernel call, which is quite baffling.

I believe we should give the fused version another try in a separate PR, investigating the causes of the unexpected slowdown, as in theory this should be a low-hanging fruit in terms of performance optimization.

The plan would be to rebase the changes introduced prior to this commit 2a9d8f1#diff-c310ada35beeefacf4f019051ceaffeb471117d5d5b8be51610d80c7632c6bdcL657-L678 and benchmark performance once again to set the baseline, then take it from there. No need to focus on spec decoding from the start, since we should observe a boost on regular chunked prefill too (just enable it).

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@NickLucche NickLucche added the performance Performance-related issues label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

1 participant