Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] GEMV kernel implementation #40

Merged
merged 52 commits into from
Sep 13, 2023
Merged

[NEW] GEMV kernel implementation #40

merged 52 commits into from
Sep 13, 2023

Conversation

casper-hansen
Copy link
Owner

@casper-hansen casper-hansen commented Sep 8, 2023

Overall, this is 60% faster than the main branch with new GEMV and FasterTransformer kernels. When both GEMM and GEMV use the same FasterTransformer kernels, GEMV is about 20-25% faster than GEMM. However, GEMV is 10x slower at processing context.

  • Two versions of kernel: WQLinear_GEMM and WQLinear_GEMV
  • Refactor qmodule into modules directory.
  • Maintain backward compatibility with GEMM kernel for a few versions of AutoAWQ Instead, maintain two versions. GEMM is currently optimized for processing context 10x faster than GEMV but with 20% less speed in token generation.
  • Implement a faster kernel to optimize inference speed

TODO:

  • Implement fused model for Falcon 7B
  • Implement fused model for Falcon 40B Do this in separate PR (maybe MQA/GQA PR), seems a large change might be required. Inspiration.
  • Implement batch size for QuantAttentionFused
  • Implement better allocation of max_seq_len to reduce VRAM usage. Create a user guide to set max_seq_len correctly. 512 tokens/3.89 GB VRAM - 8192 tokens/9GB VRAM.
  • Test building for Windows support (@qwopqwop200 would love some help)
  • Test if we need to reset kv cache and the start_pos after generating the full context of a model Not handled for now / can be worked around by setting max_new_tokens.

GEMV Benchmark

python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq-gemv

Prefill length Decode length Prefill tokens/s Decode tokens/s Memory (VRAM) GPU
4 200 9253.43 123.198 6.70 GB (28.26%) NVIDIA GeForce RTX 3090
32 32 227.125 125.736 6.70 GB (28.29%) NVIDIA GeForce RTX 3090
64 64 226.292 126.47 6.70 GB (28.29%) NVIDIA GeForce RTX 3090
128 128 227.386 124.9 6.70 GB (28.29%) NVIDIA GeForce RTX 3090
256 256 220.668 125.361 6.70 GB (28.29%) NVIDIA GeForce RTX 3090
512 512 211.274 123.781 6.70 GB (28.29%) NVIDIA GeForce RTX 3090
1024 1024 206.607 109.317 7.16 GB (30.21%) NVIDIA GeForce RTX 3090
2048 2048 209.165 108.626 9.14 GB (38.57%) NVIDIA GeForce RTX 3090

GEMM Benchmark

python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq

Prefill length Decode length Prefill tokens/s Decode tokens/s Memory (VRAM) GPU
4 200 8822.41 100.466 6.69 GB (28.26%) NVIDIA GeForce RTX 3090
32 32 1452.43 103.222 6.70 GB (28.29%) NVIDIA GeForce RTX 3090
64 64 2414.81 101.601 6.70 GB (28.29%) NVIDIA GeForce RTX 3090
128 128 2719.56 101.336 6.70 GB (28.29%) NVIDIA GeForce RTX 3090
256 256 2733.53 102.536 6.70 GB (28.29%) NVIDIA GeForce RTX 3090
512 512 2702.88 102.012 6.70 GB (28.29%) NVIDIA GeForce RTX 3090
1024 1024 2568.84 92.7161 7.16 GB (30.20%) NVIDIA GeForce RTX 3090
2048 2048 2240.23 92.16 9.14 GB (38.57%) NVIDIA GeForce RTX 3090

@casper-hansen casper-hansen mentioned this pull request Sep 11, 2023
30 tasks
@casper-hansen casper-hansen marked this pull request as ready for review September 11, 2023 09:19
@casper-hansen
Copy link
Owner Author

This PR is ready. The last thing needed is a speedup on MPT and Falcon models and a thorough testing of quantization of all models.

@casper-hansen
Copy link
Owner Author

casper-hansen commented Sep 11, 2023

The speed test for MPT is now correct and the same as Tinychat. However, the generation is just random and does not output anything correctly.

EDIT: Finally working!!

RTX 3090, MPT 7B

Prefill length Decode length Prefill tokens/s Decode tokens/s Memory (VRAM) GPU
4 200 9636.87 131.871 7.67 GB (32.37%) NVIDIA GeForce RTX 3090
32 32 235.403 134.894 7.67 GB (32.37%) NVIDIA GeForce RTX 3090
64 64 231.559 132.019 7.67 GB (32.37%) NVIDIA GeForce RTX 3090
128 128 232.226 130.809 7.67 GB (32.37%) NVIDIA GeForce RTX 3090
256 256 221.689 127.613 7.67 GB (32.37%) NVIDIA GeForce RTX 3090
512 512 223.46 126.43 7.67 GB (32.37%) NVIDIA GeForce RTX 3090
1024 1024 224.193 109.799 7.93 GB (33.46%) NVIDIA GeForce RTX 3090
2048 2048 221.074 107.975 8.94 GB (37.74%) NVIDIA GeForce RTX 3090

@casper-hansen casper-hansen mentioned this pull request Sep 12, 2023
@casper-hansen casper-hansen merged commit 1b0af2d into main Sep 13, 2023
@casper-hansen casper-hansen deleted the new_kernel branch September 13, 2023 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant