Prefix Caching with FP8 KV cache support #3234

chenxu2048 · 2024-03-06T12:59:16Z

The FP8 KV cache uses tensors with dtype torch.uint8 and converts them from fp8_e5m2 to float16 in the paged attention kernel. The Prefix cache Triton kernel cannot handle key_cache and value_cache with "wrong" dtype. Therefore, we convert them to fp8_e5m2 before the kernel and they can be upcast to the correct dtype inside the kernel.

Code was tested with Qwen/Qwen-7B-Chat.

This PR requires triton 2.2.0 and torch 2.2.x, and it depends on #2804.

chenxu2048 · 2024-03-11T05:32:11Z

We found that triton==2.1.0 could not support fp8 kernel correctly, which is depended by torch 2.1.2. We can:

use 2.1.0 by default and guide user to install triton 2.2.0 in doc.
upgrade pytorch to 2.2.1, which depends triton=2.2.0
use triton nightly build, which version is 2.1.0.xxxx and do not break torch in pip.

What do you think? @zhaoyang-star @WoosukKwon

zhaoyang-star

I think the second option is simple and clear. @WoosukKwon What do you think about this ?

vllm/model_executor/layers/attention/ops/paged_attn.py

WoosukKwon · 2024-03-13T06:35:28Z

@zhaoyang-star @chenxu2048 Can it wait a bit? We have some issues in upgrading the PyTorch version (#2804).

chenxu2048 · 2024-03-13T07:02:13Z

@zhaoyang-star @chenxu2048 Can it wait a bit? We have some issues in upgrading the PyTorch version (#2804).

@WoosukKwon Sure. Do we have any schedule about the upgrade?

Use torch.fp8_e5m2 instead of torch.uint8 in python interface

zhaoyang-star

LGTM

esmeetu · 2024-03-14T01:02:04Z

@WoosukKwon FYI, prefix caching is not working on Turing GPU with trion=2.1.0 now, also need to upgrade 2.2.0.

snippetzero · 2024-03-16T06:58:29Z

@chenxu2048 Hello, we tests on the 7B qwen model, use the modifications to the fp8 support for prefix_prefill and found that there is a loss in accuracy compared to disable the prefix_prefill and use fp8. Could this be because the process of converting fp8 to bf16 in Triton and the implementation of fp8_e5m2_unscaled::vec_conversion are different?

zhaoyang-star · 2024-03-16T10:05:52Z

@chenxu2048 Hello, we tests on the 7B qwen model, use the modifications to the fp8 support for prefix_prefill and found that there is a loss in accuracy compared to disable the prefix_prefill and use fp8. Could this be because the process of converting fp8 to bf16 in Triton and the implementation of fp8_e5m2_unscaled::vec_conversion are different?

Could you please do evaluation under the case (enable prefix_prefill meanwhile disbale fp8_e5m2) ? Just to verify if the accuray drop is caused by fp8_e5m2.

snippetzero · 2024-03-16T12:59:21Z

@chenxu2048 Hello, we tests on the 7B qwen model, use the modifications to the fp8 support for prefix_prefill and found that there is a loss in accuracy compared to disable the prefix_prefill and use fp8. Could this be because the process of converting fp8 to bf16 in Triton and the implementation of fp8_e5m2_unscaled::vec_conversion are different?

Could you please do evaluation under the case (enable prefix_prefill meanwhile disbale fp8_e5m2) ? Just to verify if the accuray drop is caused by fp8_e5m2.

In the test where 'enable prefix_prefill mean while disable fp8_e5m2', the results were normal. Only in the case of both prefix_prefill and fp8_e5m2 being enabled does the accuracy get affected.

chenxu2048 · 2024-03-18T06:39:15Z

@chenxu2048 Hello, we tests on the 7B qwen model, use the modifications to the fp8 support for prefix_prefill and found that there is a loss in accuracy compared to disable the prefix_prefill and use fp8. Could this be because the process of converting fp8 to bf16 in Triton and the implementation of fp8_e5m2_unscaled::vec_conversion are different?

Hi, @snippetzero. Could you provide the model and some inputs for testing?

Without enabling prefix caching, both the Key and Value are computed in the pre-filling stage are in FP16, while in the decoding stage, FP8 is used. However, when prefix caching is enabled, both the Key and Value in the KV Cache are in FP8.

I think additional precision loss might be introduced in pre-filling and prefix KV Cache.

laurens-gs · 2024-08-06T08:14:47Z

Flash attention 3 will support FP8 soon and Flashinfer already supports it. So I would like to bump this PR and hopefully put it on the roadmap again.

jon-chuang · 2024-08-06T14:14:46Z

vllm/model_executor/layers/attention/ops/prefix_prefill.py

+                 (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
+        k = tl.load(K_cache + off_k,
+                    mask=(start_n + offs_n[None, :]) < cur_batch_ctx_len,
+                    other=0.0).to(q.dtype)


Are you sure you should just do this? What about fp8 scaling factor?

jon-chuang · 2024-08-06T14:15:13Z

vllm/model_executor/layers/attention/ops/prefix_prefill.py

+    assert Lq == Lk and Lk == Lv
+    assert Lk in {16, 32, 64, 128}
+
+    sm_scale = 1.0 / (Lq**0.5)


What happened to the FP8 scaling factor?

comaniac · 2024-08-06T15:10:39Z

FlashInfer supports it in v0.2.0 which is being released so we will be unblocked soon

chenxu2048 added 4 commits March 6, 2024 18:15

remove triton version check

f60fd43

feat: prefix caching with fp8 kvcache

145e9c9

add triton version check

9970b79

update triton version to 2.2.0

5d31fe7

chenxu2048 force-pushed the prefix_fp8_kvcache branch from 3116104 to 5d31fe7 Compare March 7, 2024 08:43

chenxu2048 added 2 commits March 7, 2024 16:50

format code

22eddd8

convert to fp16 in _fwd_kernel_alibi

d2a3036

zhaoyang-star mentioned this pull request Mar 11, 2024

[v0.4.0] Release Tracker #3155

Closed

3 tasks

chenxu2048 added 2 commits March 11, 2024 11:47

Merge remote-tracking branch 'origin/main' into prefix_fp8_kvcache

20523b4

Align to Q type

8223978

chenxu2048 mentioned this pull request Mar 13, 2024

kv cache fp8 not support auto prefix cache #3364

Closed

zhaoyang-star reviewed Mar 13, 2024

View reviewed changes

vllm/model_executor/layers/attention/ops/paged_attn.py Outdated Show resolved Hide resolved

zhaoyang-star and others added 3 commits March 13, 2024 07:38

Use torch.fp8_e5m2 instead of torch.uint8 in python interface

8ecc90d

Merge pull request #1 from chenxu2048/fp8_e5m2

e4e92ce

Use torch.fp8_e5m2 instead of torch.uint8 in python interface

Use torch.fp8_e5m2 as fp8 kvcache dtype

5afb543

chenxu2048 requested a review from zhaoyang-star March 13, 2024 08:28

zhaoyang-star reviewed Mar 13, 2024

View reviewed changes

chenxu2048 changed the title ~~Prefix Caching with FP8 KV cache support~~ Draft: Prefix Caching with FP8 KV cache support Mar 15, 2024

chenxu2048 changed the title ~~Draft: Prefix Caching with FP8 KV cache support~~ Prefix Caching with FP8 KV cache support Mar 15, 2024

chenxu2048 marked this pull request as draft March 15, 2024 02:13

zhaoyang-star mentioned this pull request Mar 26, 2024

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) #3290

Merged

esmeetu mentioned this pull request Apr 14, 2024

[Misc] Upgrade triton to 2.2.0 #4061

Merged

chenxu2048 mentioned this pull request Apr 23, 2024

[Experimental] Prefix Caching Support #1669

Merged

5 tasks

jon-chuang mentioned this pull request Aug 6, 2024

[Bug]: Chunked prefill doesn't seem to work when --kv-cache-dtype fp8 #4381

Closed

jon-chuang reviewed Aug 6, 2024

View reviewed changes

chenxu2048 mentioned this pull request Aug 8, 2024

[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel #7208

Merged

2 tasks

chenxu2048 closed this Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix Caching with FP8 KV cache support #3234

Prefix Caching with FP8 KV cache support #3234

chenxu2048 commented Mar 6, 2024 •

edited

Loading

chenxu2048 commented Mar 11, 2024

zhaoyang-star left a comment •

edited

Loading

WoosukKwon commented Mar 13, 2024

chenxu2048 commented Mar 13, 2024

zhaoyang-star left a comment

esmeetu commented Mar 14, 2024

snippetzero commented Mar 16, 2024

zhaoyang-star commented Mar 16, 2024

snippetzero commented Mar 16, 2024

chenxu2048 commented Mar 18, 2024

laurens-gs commented Aug 6, 2024

jon-chuang Aug 6, 2024

jon-chuang Aug 6, 2024

comaniac commented Aug 6, 2024

Prefix Caching with FP8 KV cache support #3234

Prefix Caching with FP8 KV cache support #3234

Conversation

chenxu2048 commented Mar 6, 2024 • edited Loading

chenxu2048 commented Mar 11, 2024

zhaoyang-star left a comment • edited Loading

Choose a reason for hiding this comment

WoosukKwon commented Mar 13, 2024

chenxu2048 commented Mar 13, 2024

zhaoyang-star left a comment

Choose a reason for hiding this comment

esmeetu commented Mar 14, 2024

snippetzero commented Mar 16, 2024

zhaoyang-star commented Mar 16, 2024

snippetzero commented Mar 16, 2024

chenxu2048 commented Mar 18, 2024

laurens-gs commented Aug 6, 2024

jon-chuang Aug 6, 2024

Choose a reason for hiding this comment

jon-chuang Aug 6, 2024

Choose a reason for hiding this comment

comaniac commented Aug 6, 2024

chenxu2048 commented Mar 6, 2024 •

edited

Loading

zhaoyang-star left a comment •

edited

Loading