You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While working on #9291, I experimented with unifying prefills and decodes processing in a single forward call (through the flash_attn_varlen_func API), while currently we separate the two by "splitting" the flattened 1d tokens tensor (size n_prefills+n_decodes).
The unification is meaningful when chunked prefill is enabled, as it will allow mixed prefill-decodes batches to be scheduled.
Following the change, @sroy745 found no speedup in his benchmarks with the new version using a single kernel call, which is quite baffling.
I believe we should give the fused version another try in a separate PR, investigating the causes of the unexpected slowdown, as in theory this should be a low-hanging fruit in terms of performance optimization.
The plan would be to rebase the changes introduced prior to this commit 2a9d8f1#diff-c310ada35beeefacf4f019051ceaffeb471117d5d5b8be51610d80c7632c6bdcL657-L678 and benchmark performance once again to set the baseline, then take it from there. No need to focus on spec decoding from the start, since we should observe a boost on regular chunked prefill too (just enable it).
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
No response
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Proposal to improve performance
While working on #9291, I experimented with unifying prefills and decodes processing in a single forward call (through the
flash_attn_varlen_func
API), while currently we separate the two by "splitting" the flattened 1d tokens tensor (size n_prefills+n_decodes).The unification is meaningful when chunked prefill is enabled, as it will allow mixed prefill-decodes batches to be scheduled.
Following the change, @sroy745 found no speedup in his benchmarks with the new version using a single kernel call, which is quite baffling.
I believe we should give the fused version another try in a separate PR, investigating the causes of the unexpected slowdown, as in theory this should be a low-hanging fruit in terms of performance optimization.
The plan would be to rebase the changes introduced prior to this commit 2a9d8f1#diff-c310ada35beeefacf4f019051ceaffeb471117d5d5b8be51610d80c7632c6bdcL657-L678 and benchmark performance once again to set the baseline, then take it from there. No need to focus on spec decoding from the start, since we should observe a boost on regular chunked prefill too (just enable it).
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: