[PaddleInference] support ptq and cachekv_quant in BlockMultiHeadAttention op #59951

RichardWooSJTU · 2023-12-12T13:47:11Z

PR types

Function optimization

PR changes

OPs

Description

For block attention

support dynamic cachekv quant
support static cachekv quant
support PTQ fusion

Pcard-71502

paddle-bot · 2023-12-12T13:47:16Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

zhoutianzi666 · 2023-12-14T06:50:13Z

paddle/phi/kernels/gpu/c_embedding_kernel.cu

-        "be less than [%d] and greater than or equal to 0, but received [%d]",
-        vocab_size,
-        id);
+    // PADDLE_ENFORCE(


这个为啥注释掉？

目前多轮会报错在这里报错不符合预期还在找作者debug

MARD1NO · 2023-12-14T06:51:45Z

paddle/phi/kernels/gpu/c_embedding_kernel.cu

-        "be less than [%d] and greater than or equal to 0, but received [%d]",
-        vocab_size,
-        id);
+    // PADDLE_ENFORCE(


这个检查还是要的

目前多轮会报错在这里报错不符合预期还在找作者debug

vivienfanghuagood

LGTM for API change

jeff41404 · 2023-12-15T08:56:10Z

python/paddle/incubate/nn/functional/block_multihead_attention.py

+    cache_k_quant_scales=None,
+    cache_v_quant_scales=None,
+    cache_k_dequant_scales=None,
+    cache_v_dequant_scales=None,
+    qkv_out_scale=None,
+    qkv_bias=None,
+    out_shift=None,
+    out_smooth=None,


The newly added parameters, such as cache_k_quant_scales, need to be described in docstring of section Args below

Normally, the added parameters should be added at the end to ensure compatibility

This will be added in next PR.

Ligoml

LGTM for docs（需要补充中文文档）

…ntion op (PaddlePaddle#59951) * support cachekv_quant in blha --------- Co-authored-by: Wanglongzhi2001 <[email protected]>

…ntion op (#59951) (#60073) * support cachekv_quant in blha --------- Co-authored-by: Wanglongzhi2001 <[email protected]>

support cachekv_quant in blha

121733e

RichardWooSJTU force-pushed the cache_kv_quant branch from e16a233 to 121733e Compare December 12, 2023 13:50

fix build error of bf16

6ea03d8

RichardWooSJTU changed the title ~~[PaddleInference] support cachekv_quant in BlockMultiHeadAttention op~~ [PaddleInference] support ptq and cachekv_quant in BlockMultiHeadAttention op Dec 13, 2023

RichardWooSJTU added 6 commits December 13, 2023 11:16

fix build error of bf16 and unittest

5ee8018

add ut

d9bbadf

per batch quant and fix ci bf16

b678e78

fix static quant

245476c

fix ci build error and fix llama 65b error

ccc4545

merge develop

2e69d09

RichardWooSJTU force-pushed the cache_kv_quant branch from 3b56fd5 to 2e69d09 Compare December 14, 2023 03:29

zhoutianzi666 reviewed Dec 14, 2023

View reviewed changes

MARD1NO approved these changes Dec 14, 2023

View reviewed changes

RichardWooSJTU added 2 commits December 14, 2023 20:07

recover

9a204f3

recover

11e4f25

Aurelius84 approved these changes Dec 15, 2023

View reviewed changes

vivienfanghuagood approved these changes Dec 15, 2023

View reviewed changes

heavengate approved these changes Dec 15, 2023

View reviewed changes

jeff41404 reviewed Dec 15, 2023

View reviewed changes

jeff41404 approved these changes Dec 15, 2023

View reviewed changes

Ligoml approved these changes Dec 15, 2023

View reviewed changes

raindrops2sea approved these changes Dec 15, 2023

View reviewed changes

heavengate merged commit 036a314 into PaddlePaddle:develop Dec 15, 2023
28 of 29 checks passed

RichardWooSJTU mentioned this pull request Dec 15, 2023

Add missed doc string for block_multihead_attention API #60072

Merged

RichardWooSJTU mentioned this pull request Dec 15, 2023

[PaddleInference] support ptq and cachekv_quant in BlockMultiHeadAtte… #60073

Merged

raindrops2sea pushed a commit that referenced this pull request Dec 18, 2023

[PaddleInference] support ptq and cachekv_quant in BlockMultiHeadAtte…

d8745c1

…ntion op (#59951) (#60073) * support cachekv_quant in blha --------- Co-authored-by: Wanglongzhi2001 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PaddleInference] support ptq and cachekv_quant in BlockMultiHeadAttention op #59951

[PaddleInference] support ptq and cachekv_quant in BlockMultiHeadAttention op #59951

RichardWooSJTU commented Dec 12, 2023 •

edited

Loading

paddle-bot bot commented Dec 12, 2023

zhoutianzi666 Dec 14, 2023

RichardWooSJTU Dec 14, 2023

MARD1NO Dec 14, 2023

RichardWooSJTU Dec 14, 2023

vivienfanghuagood left a comment

jeff41404 Dec 15, 2023

jeff41404 Dec 15, 2023

RichardWooSJTU Dec 15, 2023

Ligoml left a comment

[PaddleInference] support ptq and cachekv_quant in BlockMultiHeadAttention op #59951

[PaddleInference] support ptq and cachekv_quant in BlockMultiHeadAttention op #59951

Conversation

RichardWooSJTU commented Dec 12, 2023 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Dec 12, 2023

zhoutianzi666 Dec 14, 2023

Choose a reason for hiding this comment

RichardWooSJTU Dec 14, 2023

Choose a reason for hiding this comment

MARD1NO Dec 14, 2023

Choose a reason for hiding this comment

RichardWooSJTU Dec 14, 2023

Choose a reason for hiding this comment

vivienfanghuagood left a comment

Choose a reason for hiding this comment

jeff41404 Dec 15, 2023

Choose a reason for hiding this comment

jeff41404 Dec 15, 2023

Choose a reason for hiding this comment

RichardWooSJTU Dec 15, 2023

Choose a reason for hiding this comment

Ligoml left a comment

Choose a reason for hiding this comment

RichardWooSJTU commented Dec 12, 2023 •

edited

Loading