Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Caching of tensors for decode (flash-attn) #7206

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

alexm-neuralmagic
Copy link
Collaborator

This PR introduces TensorCache to cache tensors between successive iterations of the scheduler/prepare_inputs. This is similar to the effect of multi-step scheduler, however, this approach maintains a single-step behavior on expense of performance hit.

TODO:

  1. Generalize caching to more complicated cases than simple +1
  2. More benchmarks (Looks like it needs to be combined with e2e overheads reductions to see more speedup. In particular it works better when combined with this PR: [Performance] Optimize e2e overheads: Reduce python allocations #7162)

Copy link

github-actions bot commented Aug 6, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@alexm-neuralmagic
Copy link
Collaborator Author

@alexm-neuralmagic
Copy link
Collaborator Author

I did a first benchmarks, and I can only see speedups when this PR is combined with #7162. Will investigate this more. It seems like we need to reduce the python bottlenecks first to actually see speedups from tensor caching.

@comaniac
Copy link
Collaborator

comaniac commented Aug 6, 2024

Does that mean the speedup of this optimization is pretty marginal? I'm still debating between code simplicity and efficiency, so if we can't see a stable speedup (even it's just 2-3%), we should consider whether to introduce this.

@alexm-neuralmagic alexm-neuralmagic marked this pull request as draft August 6, 2024 15:47
@alexm-neuralmagic alexm-neuralmagic changed the title Caching of tensors for decode (flash-attn) [WIP] Caching of tensors for decode (flash-attn) Aug 6, 2024
@alexm-neuralmagic
Copy link
Collaborator Author

@comaniac you right, I still need sometime to work on this to understand what's going on. It looks like we first need to finish #7162 and then I can investigate this more.

@alexm-neuralmagic
Copy link
Collaborator Author

Converted to draft to indicate WIP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants