[WIP] Caching of tensors for decode (flash-attn) #7206

alexm-neuralmagic · 2024-08-06T13:57:35Z

This PR introduces TensorCache to cache tensors between successive iterations of the scheduler/prepare_inputs. This is similar to the effect of multi-step scheduler, however, this approach maintains a single-step behavior on expense of performance hit.

TODO:

Generalize caching to more complicated cases than simple +1
More benchmarks (Looks like it needs to be combined with e2e overheads reductions to see more speedup. In particular it works better when combined with this PR: [Performance] Optimize e2e overheads: Reduce python allocations #7162)

github-actions · 2024-08-06T13:57:47Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

alexm-neuralmagic · 2024-08-06T13:58:37Z

@njhill @comaniac @robertgshaw2-neuralmagic

alexm-neuralmagic · 2024-08-06T13:59:29Z

I did a first benchmarks, and I can only see speedups when this PR is combined with #7162. Will investigate this more. It seems like we need to reduce the python bottlenecks first to actually see speedups from tensor caching.

comaniac · 2024-08-06T15:44:06Z

Does that mean the speedup of this optimization is pretty marginal? I'm still debating between code simplicity and efficiency, so if we can't see a stable speedup (even it's just 2-3%), we should consider whether to introduce this.

alexm-neuralmagic · 2024-08-06T15:48:20Z

@comaniac you right, I still need sometime to work on this to understand what's going on. It looks like we first need to finish #7162 and then I can investigate this more.

alexm-neuralmagic · 2024-08-06T15:48:30Z

Converted to draft to indicate WIP

alexm-neuralmagic added 2 commits August 6, 2024 13:41

flash-attn decode tensor caching

79c027b

sync

c8dcb8e

alexm-neuralmagic mentioned this pull request Aug 6, 2024

[Performance] Optimize e2e overheads: Reduce python allocations #7162

Merged

alexm-neuralmagic marked this pull request as draft August 6, 2024 15:47

alexm-neuralmagic changed the title ~~Caching of tensors for decode (flash-attn)~~ [WIP] Caching of tensors for decode (flash-attn) Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Caching of tensors for decode (flash-attn) #7206

[WIP] Caching of tensors for decode (flash-attn) #7206

alexm-neuralmagic commented Aug 6, 2024

github-actions bot commented Aug 6, 2024

alexm-neuralmagic commented Aug 6, 2024

alexm-neuralmagic commented Aug 6, 2024

comaniac commented Aug 6, 2024

alexm-neuralmagic commented Aug 6, 2024

alexm-neuralmagic commented Aug 6, 2024

[WIP] Caching of tensors for decode (flash-attn) #7206

Are you sure you want to change the base?

[WIP] Caching of tensors for decode (flash-attn) #7206

Conversation

alexm-neuralmagic commented Aug 6, 2024

github-actions bot commented Aug 6, 2024

alexm-neuralmagic commented Aug 6, 2024

alexm-neuralmagic commented Aug 6, 2024

comaniac commented Aug 6, 2024

alexm-neuralmagic commented Aug 6, 2024

alexm-neuralmagic commented Aug 6, 2024