[CPU]PageAttn with 4bit-quantization #27992

zhangYiIntel · 2024-12-10T08:23:41Z

Details:

Add new hint to set group_size for key/value cache
Add grouped 4bit sym/asym quantization support for PageAttentionNode
Add grouped quantization for U8 quantization for PageAttentionNode

Tickets:

CVS-151586

Signed-off-by: [email protected] <[email protected]>

Signed-off-by: Zhang Yi3 <[email protected]>

src/plugins/intel_cpu/src/nodes/paged_attn.cpp

src/plugins/intel_cpu/src/nodes/scaled_attn.cpp

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant_kernel.hpp

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

Signed-off-by: Zhang Yi3 <[email protected]>

zhangYiIntel added 13 commits December 9, 2024 16:03

[CPU]separate precisions of kv cache

15fcdb8

Signed-off-by: [email protected] <[email protected]>

[CPU]use element as template args

82f843a

[CPU]make quantize grouped

a754404

[CPU]make u8 kernel grouped

2aba224

[CPU]U4 Group size support with reference

fc435f6

Signed-off-by: [email protected] <[email protected]>

[CPU]AVX512 support for u4 kernel

d080e2a

[CPU]Support S4 quantization

78ef4dd

Signed-off-by: [email protected] <[email protected]>

[CPU]use AVX512 to quant s4

3e821ea

[CPU]4-bit quantization with avx2

80b093f

Signed-off-by: [email protected] <[email protected]>

fix build on elder compiler

13a496e

[CPU]fix fp32 inference

92e6cb3

[CPU]set group size via hint

91ebc09

Signed-off-by: Zhang Yi3 <[email protected]>

[CPU]fix code style

685f263

Signed-off-by: Zhang Yi3 <[email protected]>

github-actions bot added category: inference OpenVINO Runtime library - Inference category: CPU OpenVINO CPU plugin category: Python API OpenVINO Python bindings category: CPP API OpenVINO CPP API bindings labels Dec 10, 2024

zhangYiIntel changed the title ~~Yi3/4bit cache~~ [CPU]PageAttn with 4bit-quantization Dec 10, 2024

[CPU]fix property test

e56639a

Signed-off-by: Zhang Yi3 <[email protected]>

zhangYiIntel force-pushed the yi3/4bit-cache branch from a12c86f to e56639a Compare December 11, 2024 02:45

[CPU]add cache precision check

a34ce8b

Signed-off-by: Zhang Yi3 <[email protected]>

zhangYiIntel mentioned this pull request Dec 12, 2024

[CB]Support 4-bit cache openvinotoolkit/openvino.genai#1366

Draft

zhangYiIntel marked this pull request as ready for review December 12, 2024 01:57

zhangYiIntel requested review from a team as code owners December 12, 2024 01:57

zhangYiIntel added 2 commits December 12, 2024 09:57

Merge branch 'master' into yi3/4bit-cache

8548773

[CPU]fix code style of config.cpp

fe6c311

Signed-off-by: Zhang Yi3 <[email protected]>

zhangYiIntel force-pushed the yi3/4bit-cache branch from 373d50d to fe6c311 Compare December 12, 2024 03:17

zhangYiIntel force-pushed the yi3/4bit-cache branch from 5bc75f8 to 76380d1 Compare December 12, 2024 08:35

Merge branch 'master' into yi3/4bit-cache

522215a

Signed-off-by: Zhang Yi3 <[email protected]>

zhangYiIntel force-pushed the yi3/4bit-cache branch from 76380d1 to 522215a Compare December 13, 2024 00:17

yuxu42 requested a review from luo-cheng2021 December 13, 2024 01:40

luo-cheng2021 reviewed Dec 13, 2024

View reviewed changes

zhangYiIntel added 3 commits December 17, 2024 16:51

[CPU]pre calculate count

8faadd8

[CPU]Use ov::element as template args

b4b0f0d

Signed-off-by: Zhang Yi3 <[email protected]>

[CPU]remove redundant marco

5c838f7

zhangYiIntel requested a review from luo-cheng2021 December 18, 2024 08:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU]PageAttn with 4bit-quantization #27992

[CPU]PageAttn with 4bit-quantization #27992

zhangYiIntel commented Dec 10, 2024

[CPU]PageAttn with 4bit-quantization #27992

Are you sure you want to change the base?

[CPU]PageAttn with 4bit-quantization #27992

Conversation

zhangYiIntel commented Dec 10, 2024

Details:

Tickets: