Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3rdparty] Bump FlashInfer #17236

Merged

Conversation

MasterJH5574
Copy link
Contributor

@MasterJH5574 MasterJH5574 commented Aug 2, 2024

This PR bumps FlashInfer and updates PagedKVCache accordingly for performance improvement.

Some notes on this bump:

  • When the Grouped-Query Attention group size is at least 4 and FlashInfer is enabled, we use the prefill attn kernel for better performance.
  • We enlarge the temporary workspace for FlashInfer use accordingly, as FlashInfer in the current version may consume much larger workspace. We turn off the workspace when FlashInfer is not enabled.
  • We reduce the max block depth to be 2, in observation of the limited help of cascade inference when batch size is not large and the prompt reuse is low.

This PR bumps FlashInfer and updates PagedKVCache accordingly
for performance improvement.

Some notes on this bump:

* When the Grouped-Query Attention group size is at least 4 and
FlashInfer is enabled, we use the prefill attn kernel for better
performance.
* We enlarge the temporary workspace for FlashInfer use accordingly,
as FlashInfer in the current version may consume much larger workspace.
We turn off the workspace when FlashInfer is not enabled.
* We reduce the max block depth to be 2, in observation of the limited
help of cascade inference when batch size is not large and the prompt
reuse is low.
@MasterJH5574 MasterJH5574 force-pushed the tvm-dev/2024-08-02-bump-flashinfer branch from d695af4 to e6987df Compare August 2, 2024 21:45
@tqchen tqchen merged commit 76b954a into apache:main Aug 3, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants