[Performance]: Unified flashattn kernel not outperforming current one #10707

NickLucche · 2024-11-27T11:09:32Z

Proposal to improve performance

While working on #9291, I experimented with unifying prefills and decodes processing in a single forward call (through the flash_attn_varlen_func API), while currently we separate the two by "splitting" the flattened 1d tokens tensor (size n_prefills+n_decodes).
The unification is meaningful when chunked prefill is enabled, as it will allow mixed prefill-decodes batches to be scheduled.

Following the change, @sroy745 found no speedup in his benchmarks with the new version using a single kernel call, which is quite baffling.

I believe we should give the fused version another try in a separate PR, investigating the causes of the unexpected slowdown, as in theory this should be a low-hanging fruit in terms of performance optimization.

The plan would be to rebase the changes introduced prior to this commit 2a9d8f1#diff-c310ada35beeefacf4f019051ceaffeb471117d5d5b8be51610d80c7632c6bdcL657-L678 and benchmark performance once again to set the baseline, then take it from there. No need to focus on spec decoding from the start, since we should observe a boost on regular chunked prefill too (just enable it).

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

NickLucche added the performance Performance-related issues label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: Unified flashattn kernel not outperforming current one #10707

[Performance]: Unified flashattn kernel not outperforming current one #10707

NickLucche commented Nov 27, 2024

[Performance]: Unified flashattn kernel not outperforming current one #10707

[Performance]: Unified flashattn kernel not outperforming current one #10707

Comments

NickLucche commented Nov 27, 2024

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...