-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEW] GEMV kernel implementation #40
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR is ready. The last thing needed is a speedup on MPT and Falcon models and a thorough testing of quantization of all models. |
The speed test for MPT is now correct and the same as Tinychat. However, the generation is just random and does not output anything correctly. EDIT: Finally working!! RTX 3090, MPT 7B
|
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overall, this is 60% faster than the main branch with new GEMV and FasterTransformer kernels. When both GEMM and GEMV use the same FasterTransformer kernels, GEMV is about 20-25% faster than GEMM. However, GEMV is 10x slower at processing context.
WQLinear_GEMM
andWQLinear_GEMV
qmodule
intomodules
directory.Maintain backward compatibility with GEMM kernel for a few versions of AutoAWQInstead, maintain two versions. GEMM is currently optimized for processing context 10x faster than GEMV but with 20% less speed in token generation.TODO:
Implement fused model for Falcon 40BDo this in separate PR (maybe MQA/GQA PR), seems a large change might be required. Inspiration.Implement better allocation of max_seq_len to reduce VRAM usage.Create a user guide to set max_seq_len correctly. 512 tokens/3.89 GB VRAM - 8192 tokens/9GB VRAM.Test if we need to reset kv cache and the start_pos after generating the full context of a modelNot handled for now / can be worked around by settingmax_new_tokens
.GEMV Benchmark
python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq-gemv
GEMM Benchmark
python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq