[NEW] GEMV kernel implementation #40

casper-hansen · 2023-09-08T13:02:40Z

Overall, this is 60% faster than the main branch with new GEMV and FasterTransformer kernels. When both GEMM and GEMV use the same FasterTransformer kernels, GEMV is about 20-25% faster than GEMM. However, GEMV is 10x slower at processing context.

Two versions of kernel: WQLinear_GEMM and WQLinear_GEMV
Refactor qmodule into modules directory.
~~Maintain backward compatibility with GEMM kernel for a few versions of AutoAWQ~~ Instead, maintain two versions. GEMM is currently optimized for processing context 10x faster than GEMV but with 20% less speed in token generation.
Implement a faster kernel to optimize inference speed

TODO:

Implement fused model for Falcon 7B
~~Implement fused model for Falcon 40B~~ Do this in separate PR (maybe MQA/GQA PR), seems a large change might be required. Inspiration.
Implement batch size for QuantAttentionFused
~~Implement better allocation of max_seq_len to reduce VRAM usage.~~ Create a user guide to set max_seq_len correctly. 512 tokens/3.89 GB VRAM - 8192 tokens/9GB VRAM.
Test building for Windows support (@qwopqwop200 would love some help)
~~Test if we need to reset kv cache and the start_pos after generating the full context of a model~~ Not handled for now / can be worked around by setting max_new_tokens.

GEMV Benchmark

python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq-gemv

Prefill length	Decode length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)	GPU
4	200	9253.43	123.198	6.70 GB (28.26%)	NVIDIA GeForce RTX 3090
32	32	227.125	125.736	6.70 GB (28.29%)	NVIDIA GeForce RTX 3090
64	64	226.292	126.47	6.70 GB (28.29%)	NVIDIA GeForce RTX 3090
128	128	227.386	124.9	6.70 GB (28.29%)	NVIDIA GeForce RTX 3090
256	256	220.668	125.361	6.70 GB (28.29%)	NVIDIA GeForce RTX 3090
512	512	211.274	123.781	6.70 GB (28.29%)	NVIDIA GeForce RTX 3090
1024	1024	206.607	109.317	7.16 GB (30.21%)	NVIDIA GeForce RTX 3090
2048	2048	209.165	108.626	9.14 GB (38.57%)	NVIDIA GeForce RTX 3090

GEMM Benchmark

python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq

Prefill length	Decode length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)	GPU
4	200	8822.41	100.466	6.69 GB (28.26%)	NVIDIA GeForce RTX 3090
32	32	1452.43	103.222	6.70 GB (28.29%)	NVIDIA GeForce RTX 3090
64	64	2414.81	101.601	6.70 GB (28.29%)	NVIDIA GeForce RTX 3090
128	128	2719.56	101.336	6.70 GB (28.29%)	NVIDIA GeForce RTX 3090
256	256	2733.53	102.536	6.70 GB (28.29%)	NVIDIA GeForce RTX 3090
512	512	2702.88	102.012	6.70 GB (28.29%)	NVIDIA GeForce RTX 3090
1024	1024	2568.84	92.7161	7.16 GB (30.20%)	NVIDIA GeForce RTX 3090
2048	2048	2240.23	92.16	9.14 GB (38.57%)	NVIDIA GeForce RTX 3090

casper-hansen · 2023-09-11T09:22:45Z

This PR is ready. The last thing needed is a speedup on MPT and Falcon models and a thorough testing of quantization of all models.

casper-hansen · 2023-09-11T14:09:42Z

The speed test for MPT is now correct and the same as Tinychat. However, the generation is just random and does not output anything correctly.

EDIT: Finally working!!

RTX 3090, MPT 7B

Prefill length	Decode length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)	GPU
4	200	9636.87	131.871	7.67 GB (32.37%)	NVIDIA GeForce RTX 3090
32	32	235.403	134.894	7.67 GB (32.37%)	NVIDIA GeForce RTX 3090
64	64	231.559	132.019	7.67 GB (32.37%)	NVIDIA GeForce RTX 3090
128	128	232.226	130.809	7.67 GB (32.37%)	NVIDIA GeForce RTX 3090
256	256	221.689	127.613	7.67 GB (32.37%)	NVIDIA GeForce RTX 3090
512	512	223.46	126.43	7.67 GB (32.37%)	NVIDIA GeForce RTX 3090
1024	1024	224.193	109.799	7.93 GB (33.46%)	NVIDIA GeForce RTX 3090
2048	2048	221.074	107.975	8.94 GB (37.74%)	NVIDIA GeForce RTX 3090

casper-hansen added 23 commits September 8, 2023 11:19

New kernels

ef6b60e

Refactor modules, create separate GEMM and GEMV

fe31416

Add deprecation warning

9b2946b

Implement GEMM/GEMV in quantize function and fused modules

5db43a7

Fix variables

6534f5e

Default to GEMV

331ff95

Fix variables and no contiguous memory for GEMV

e197a73

Fix MLP module

7571080

Attention works

fbfa9d8

Update kernel, remove unused code

a11c313

Use apply_rotary_emb

64e6b3e

New benchmark script.

341c886

More detailed benchmark

5297ecc

GEMM + GEMV compatibility

890b6aa

Create torch SDPA implementation

ebbbc3a

Remove warnings about GEMM

c58ec73

Default to GEMM

48be2ee

Implement ALiBi.

adc5304

Update module name

5bd6fbc

Fuse MPT

f3695d6

Add xformers requirement

06bea89

Fix alibi, rotary arg, neox, arg

73c5e2b

Falcon fused layers

54f0285

casper-hansen mentioned this pull request Sep 11, 2023

📌 AutoAWQ Roadmap #32

Closed

30 tasks

casper-hansen marked this pull request as ready for review September 11, 2023 09:19

casper-hansen added 4 commits September 11, 2023 13:28

Switch to torch SDPA

d7badef

Fuse MPT block

950851b

Remove fusing attention, only blocks

ac3e86d

Initialize with device

e80663b

casper-hansen added 5 commits September 11, 2023 17:39

Create MPTModel class

90f54db

Fix attention, support alibi

7631add

Update MPTBlock, fuse with MPTModel

dd41a22

Remove xformers

1aa8aeb

Use CUDA stream

e120c9b

casper-hansen mentioned this pull request Sep 12, 2023

Exllama integration #30

Closed

casper-hansen added 20 commits September 12, 2023 13:22

Create Falcon block and model for fusing

4517b3f

Refactor view/reshaping into a predefined dict

a8c9afd

xk_reshape key

7f8f9f1

Create custom attention shape for Falcon 7B

ac77087

Falcon: Enable fused attention

f03f72d

Fix Falcon inputs ids

ba4da39

xq view key

bc4f93c

Falcon 40B attention shapes [WIP]

28d52d8

Support batch size benchmark, stop if OOM

0091f1e

Catch out of memory exception

a2aa804

Set AWQ_BATCH_SIZE environment variable

6e0bde1

Set the batch_size

fdff74d

Set batch size for attention shapes

d3625d1

Update README.md

196119b

Add explanation of from_quantized variables

c88b2e2

Guide on quantized vs non-quantized

98f6d7b

More consistent benchmark

86ea8df

Update benchmarks

7cf3c79

Fix Falcon benchmark format

dc99d2f

Update README

f264ebb

casper-hansen merged commit 1b0af2d into main Sep 13, 2023

casper-hansen deleted the new_kernel branch September 13, 2023 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] GEMV kernel implementation #40

[NEW] GEMV kernel implementation #40

casper-hansen commented Sep 8, 2023 •

edited

Loading

casper-hansen commented Sep 11, 2023

casper-hansen commented Sep 11, 2023 •

edited

Loading

[NEW] GEMV kernel implementation #40

[NEW] GEMV kernel implementation #40

Conversation

casper-hansen commented Sep 8, 2023 • edited Loading

GEMV Benchmark

GEMM Benchmark

casper-hansen commented Sep 11, 2023

casper-hansen commented Sep 11, 2023 • edited Loading

casper-hansen commented Sep 8, 2023 •

edited

Loading

casper-hansen commented Sep 11, 2023 •

edited

Loading