-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Caching of tensors for decode (flash-attn) #7206
base: main
Are you sure you want to change the base?
[WIP] Caching of tensors for decode (flash-attn) #7206
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
I did a first benchmarks, and I can only see speedups when this PR is combined with #7162. Will investigate this more. It seems like we need to reduce the python bottlenecks first to actually see speedups from tensor caching. |
Does that mean the speedup of this optimization is pretty marginal? I'm still debating between code simplicity and efficiency, so if we can't see a stable speedup (even it's just 2-3%), we should consider whether to introduce this. |
Converted to draft to indicate WIP |
This PR introduces TensorCache to cache tensors between successive iterations of the scheduler/prepare_inputs. This is similar to the effect of multi-step scheduler, however, this approach maintains a single-step behavior on expense of performance hit.
TODO: