Fused attention: Switch to Flash Decoding #656

casper-hansen · 2024-11-26T16:36:31Z

Current implementation

Device: cuda:0
GPU: NVIDIA GeForce RTX 4090
Model: hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
Version: gemm
|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)    |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:-----------------|
|            1 |               32 |              32 |             213.47 |             97.07 | 5.48 GB (23.16%) |
|            1 |               64 |              64 |            3985.32 |             96.23 | 5.48 GB (23.20%) |
|            1 |              128 |             128 |            4977.39 |             94.95 | 5.50 GB (23.27%) |
|            1 |              256 |             256 |            5416.4  |             94.45 | 5.54 GB (23.42%) |
|            1 |              512 |             512 |            5403.73 |             93.73 | 5.64 GB (23.87%) |
|            1 |             1024 |            1024 |            7218.92 |             92.74 | 5.89 GB (24.90%) |
|            1 |             2048 |            2048 |            7684.16 |             83.76 | 6.43 GB (27.21%) |
|            1 |             4096 |            4096 |            7308.05 |             59.79 | 7.52 GB (31.82%) |

With `flash-attn`

Device: cuda:0
GPU: NVIDIA GeForce RTX 4090
Model: hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
Version: gemm
|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)    |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:-----------------|
|            1 |               32 |              32 |             232.19 |            107.17 | 5.48 GB (23.16%) |
|            1 |               64 |              64 |            4030.56 |            106.3  | 5.48 GB (23.20%) |
|            1 |              128 |             128 |            5044.31 |            104.98 | 5.50 GB (23.27%) |
|            1 |              256 |             256 |            5290.12 |            104.99 | 5.54 GB (23.42%) |
|            1 |              512 |             512 |            5457.14 |            104.55 | 5.64 GB (23.87%) |
|            1 |             1024 |            1024 |            7465.82 |            104.06 | 5.88 GB (24.85%) |
|            1 |             2048 |            2048 |            8284.3  |            104.03 | 6.41 GB (27.12%) |
|            1 |             4096 |            4096 |            8487.37 |            103.77 | 7.48 GB (31.63%) |

casper-hansen added 5 commits November 24, 2024 11:10

initial refactor

576078e

generation is coherent

79d9f03

flash_attn_with_kvcache for decoding

8915317

cleanup + formatting

0913bc2

add flash-attn to kernels extras

902779f

casper-hansen merged commit dfe396a into main Nov 26, 2024

casper-hansen mentioned this pull request Dec 13, 2024

[MAJOR] Update to TinyChat 2.0 mit-han-lab/llm-awq#244

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused attention: Switch to Flash Decoding #656

Fused attention: Switch to Flash Decoding #656

casper-hansen commented Nov 26, 2024

Fused attention: Switch to Flash Decoding #656

Fused attention: Switch to Flash Decoding #656

Conversation

casper-hansen commented Nov 26, 2024

Current implementation

With flash-attn

With `flash-attn`