[CPU] remove extra convert for fp16 #26755

xczhai · 2024-09-24T10:10:23Z

Details:

In order to remove the extra convert node imported by ConvertPrecision pass, take some optimizations on the specific patterns. The detailed passes are listed as below.

remove extra convert to meet sdpa fusion
fuse rms with following convert
fuse fc with following convert
fuse llmmlp with following convert
After that, sdpa/rms/fc/llmmlp can fuse the following convert node and meet the fusion pattern. As a result, the preformance gains.

index	optimization pathway	1st token/ms	2nd token/ms	convert num	convert cost/1st token	convert cost/2nd token
-1	target bf16 perf	699.14	70.63	9	0.136	0.066
0	before enabling f16 for sdpa	1369.85	78.05	147	109.484	1.650
1	enable f16 for sdpa	1035.92	76.44	147	111.867	1.632
2	fix sdpa fusion matching in f16	801.38	72.33	146	120.303	1.679
3	fuse RMS with convert in f16	848.72	72.38	81	117.415	1.136
4	fuse FC with convert in f16	780.73	71.69	48	53.783	0.700
5	fuse LLMMLP with convert in f16	716.79	72.01	16	2.569	0.182

fp16 performance is very close to bf16.

Tickets:

152405

- remove Convert for ReadValue node

github-actions · 2024-11-16T00:23:44Z

This PR will be closed in a week because of 2 weeks of no activity.

- remove Convert for ReadValue node

Xc/debug f16 convert

liubo-intel · 2024-11-21T05:41:44Z

src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp

-                                 false);
+                                 need_convert_input_output_precision,
+                                 save_original_precision_attribute);


Hi, @xczhai : not a review comments, just a question: from the description of this 'save_original_precision_attribute' param, it seems used to save the original precision of this node. but I didn't find any change of such original precisions process in this pr, so which part of logic will be effected based on this change?

Same question here, where it is actually being used and what logic does it affect?

Also, an original name store_original_precision_as_rt_attribute better explains the idea behind this flag.

@liubo-intel @EgorDuplensky
answer your question.

store_original_precision_as_rt_attribute will work in the line https://github.com/openvinotoolkit/openvino/blob/master/src/common/transformations/src/transformations/convert_precision.cpp#L426. It means that such pass will insert a Convert node after ReadValue. But it is not necessary in f16 hint. As a result, the following pass sdpafusion cannot match because such extra Convert.

Both CPU and GPU call ConvertPrecision to impl f16 conversion. In most cases, the output graph should be similar.

openvino/src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp

Line 394 in 1b3550e

manager.register_pass<ov::pass::ConvertPrecision>(fp_convert_precision_map,

It is aligned with GPU. I think it is good f16 reference since gpu starts f16 earlier than cpu.

usstq · 2024-11-22T00:46:33Z

src/plugins/intel_cpu/src/nodes/kernels/x64/mlp_kernel.cpp

-            if (m_to_f16) {
-                vcvtps2ph(ptr[dst + loop_i * 2], zmm0, 0x4);
-                vcvtps2ph(ptr[dst + loop_i * 2 + 32], zmm2, 0x4);
+            if (m_out_f32 && m_to_f16) {


it seems that m_out_f32 should override m_to_f16 flag? I mean, once a convert to fp32 is fused, it should always store f32 result, right? so if (m_out_f32) should be enough, right?

it seems that m_out_f32 should override m_to_f16 flag? I mean, once a convert to fp32 is fused, it should always store f32 result, right? so if (m_out_f32) should be enough, right?

@usstq unify this logic. replace m_to_16 with m_output_type to mark possible output precision.

usstq · 2024-11-22T00:51:49Z

src/plugins/intel_cpu/src/nodes/kernels/x64/mlp_kernel.cpp

-            if (m_to_f16) {
-                vcvtps2ph(ptr[dst + loop_i * 2], zmm0, 0x4);
-                vcvtps2ph(ptr[dst + loop_i * 2 + 32], zmm2, 0x4);
+            if (m_out_f32 && m_to_f16) {


same as above

same as above

@usstq update as above.

usstq · 2024-11-22T00:59:10Z

src/plugins/intel_cpu/src/transformations/cpu_opset/x64/pass/mlp_fuse_convert.cpp

+        if (!mlp_node) {
+            return false;
+        }
+


maybe add a check here to make sure convert is the only child of mlp node.

maybe add a check here to make sure convert is the only child of mlp node.

@usstq add has_only_child check

usstq · 2024-11-22T01:00:45Z

src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/fc_convert_fusion.cpp

+        const auto& m_fc = pattern_map.at(fc).get_node_shared_ptr();
+        const auto& m_convert = pattern_map.at(convert).get_node_shared_ptr();
+        auto output_type = m_convert->get_output_element_type(0);
+


maybe add a check here to make sure convert is the only child of fc node.

maybe add a check here to make sure convert is the only child of fc node.

@usstq update as above.

usstq · 2024-11-22T01:06:05Z

src/plugins/intel_cpu/src/transformations/cpu_opset/x64/op/llm_mlp.hpp

@@ -25,6 +25,7 @@ class LLMMLPNode : public ov::op::Op {
        int hidden_size;
        int up_size;
        bool gate_up_combined;
+        bool tail_f32 = false;


maybe we can add ov::element::Type output_type = ov::element::undefined; instead, to be consistent with FullyConnectedNode

maybe we can add ov::element::Type output_type = ov::element::undefined; instead, to be consistent with FullyConnectedNode

@usstq I see. update the spec

dmitry-gorokhov · 2024-11-25T11:51:16Z

@xczhai @usstq I have couple of concerns regarding current deisgn:

Why do we need to have separate Convert fusion transformation based on target operation type? Sounds like such fusion should be universal transformation (same as ConvertPrecision) and the plugin should only specify target operations types and ports where Convert should be fused.
Fusion on ngraph model level representation doesn't cover "post-ops" case. Common behavior of ConvertPrecision transformation is to keep some activation functions in fp32 to preserve accuracy. It basically means the following pattern after ConvertPrecision pass: [fp16] -> FC [fp32] -> Activation [fp32] -> Convert [fp16]. So in order to fuse such Convert into FC we need to apply graph_optimizer passes first.
I expect some day we will fully rewrite all graph_optimizer passes to ngraph rails, but for now we have to deal with that. So my proposal is to have universal pass that will allow to fuse Convert operation into other (and replace newly implemented passes with it).

What is you thoughts on that?

xczhai · 2024-11-25T13:32:12Z

@xczhai @usstq I have couple of concerns regarding current deisgn:

Why do we need to have separate Convert fusion transformation based on target operation type? Sounds like such fusion should be universal transformation (same as ConvertPrecision) and the plugin should only specify target operations types and ports where Convert should be fused.

Fusion on ngraph model level representation doesn't cover "post-ops" case. Common behavior of ConvertPrecision transformation is to keep some activation functions in fp32 to preserve accuracy. It basically means the following pattern after ConvertPrecision pass: [fp16] -> FC [fp32] -> Activation [fp32] -> Convert [fp16]. So in order to fuse such Convert into FC we need to apply graph_optimizer passes first.
I expect some day we will fully rewrite all graph_optimizer passes to ngraph rails, but for now we have to deal with that. So my proposal is to have universal pass that will allow to fuse Convert operation into other (and replace newly implemented passes with it).

What is you thoughts on that?

@dmitry-gorokhov
I get your concern on such implementation. I can explain something during the development.

For LLM model, the Convert nodes are mainly inserted after FC and LLMMLP. These Convert nodes block the perfermance heavily. I find it that there's an opportunity for fusing Convert into FC or LLMMLP. Optimize the f16 perf step by step so create two similar fusion pass. You mean these two fusion pass can be unified into one fusion pass. Please correct me if I understand it incorrectly.
I have some different comment on So in order to fuse such Convert into FC we need to apply graph_optimizer passes first.. During above optimization, I refer to some GPU implementation. GPU team implement a similar FC + Convert fusion pass

openvino/src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp

Line 937 in 8ec51b2

manager.register_pass<ov::intel_gpu::FullyConnectedConvertFusion>();

. Looks such fusion pass is simple to maintain.
Last but not least, we try to fuse FC / LLMMLP with Convert(f16=>f32) instead of Convert(f32->f16). So we can save much time on these inserted Convert(f16=>f32) nodes .

github-actions · 2024-12-24T00:23:23Z

This PR will be closed in a week because of 2 weeks of no activity.

xczhai requested review from a team as code owners September 24, 2024 10:10

xczhai marked this pull request as draft September 24, 2024 10:10

github-actions bot added the category: CPU OpenVINO CPU plugin label Sep 24, 2024

xczhai force-pushed the xc/fix_sdpa_fusion_for_f16 branch from 2643652 to cd899d7 Compare September 24, 2024 10:16

remove extra or redundant Convert in FP16

36578f0

- remove Convert for ReadValue node

xczhai force-pushed the xc/fix_sdpa_fusion_for_f16 branch from cd899d7 to 36578f0 Compare October 31, 2024 03:28

xczhai marked this pull request as ready for review October 31, 2024 09:09

Merge branch 'master' into xc/fix_sdpa_fusion_for_f16

41eccac

xczhai requested a review from liubo-intel October 31, 2024 09:10

github-actions bot added the Stale label Nov 16, 2024

xczhai added 8 commits November 19, 2024 22:35

remove extra or redundant Convert in FP16

b94ac6d

- remove Convert for ReadValue node

debug

7796c19

debug2

b0caf24

remove rms/fc convert

1720dc9

fuse mlp and convert

fd99fce

fix the mlp accuracy

c13dfe1

Merge branch 'openvinotoolkit:master' into xc/fix_sdpa_fusion_for_f16

0455e56

Merge pull request #18 from xczhai/xc/debug_f16_convert

ca8f26b

Xc/debug f16 convert

xczhai requested review from a team as code owners November 20, 2024 05:58

xczhai requested review from itikhono and removed request for a team November 20, 2024 05:58

github-actions bot added category: GPU OpenVINO GPU plugin category: transformations OpenVINO Runtime library - Transformations and removed category: GPU OpenVINO GPU plugin category: transformations OpenVINO Runtime library - Transformations labels Nov 20, 2024

xczhai force-pushed the xc/fix_sdpa_fusion_for_f16 branch from 0dfa834 to 3491756 Compare November 21, 2024 01:59

xczhai force-pushed the xc/fix_sdpa_fusion_for_f16 branch 2 times, most recently from c7601f3 to 6d362a0 Compare November 21, 2024 02:56

clear logs

5901ec9

xczhai force-pushed the xc/fix_sdpa_fusion_for_f16 branch from 6d362a0 to 5901ec9 Compare November 21, 2024 05:22

liubo-intel reviewed Nov 21, 2024

View reviewed changes

yuxu42 requested a review from usstq November 21, 2024 07:03

github-actions bot removed the Stale label Nov 22, 2024

usstq reviewed Nov 22, 2024

View reviewed changes

xczhai added 2 commits November 21, 2024 21:40

fix ci test

85f3d9e

check only one child for mlp/fc fuse

4b4d8eb

xczhai force-pushed the xc/fix_sdpa_fusion_for_f16 branch 2 times, most recently from db91b52 to 050df44 Compare November 25, 2024 09:03

refactor mlp op spec; refactor ReduceAdd2bh kernel

688d26b

xczhai force-pushed the xc/fix_sdpa_fusion_for_f16 branch from 050df44 to 688d26b Compare November 25, 2024 09:36

xczhai added 2 commits November 25, 2024 04:52

fix a arm error

aa4c1bb

fix a ci warning

10b2492

xczhai requested review from liubo-intel, EgorDuplensky and usstq November 25, 2024 10:13

xczhai added 2 commits November 25, 2024 22:16

Merge branch 'master' into xc/fix_sdpa_fusion_for_f16

5e33466

Merge branch 'master' into xc/fix_sdpa_fusion_for_f16

0dba6ff

wenjiew added this to the 2025.0 milestone Dec 9, 2024

github-actions bot added the Stale label Dec 24, 2024

yuxu42 added no_stale Do not mark as stale and removed Stale labels Dec 24, 2024

xczhai force-pushed the xc/fix_sdpa_fusion_for_f16 branch from 9c02940 to 0dba6ff Compare January 6, 2025 08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] remove extra convert for fp16 #26755

[CPU] remove extra convert for fp16 #26755

xczhai commented Sep 24, 2024 •

edited

Loading

github-actions bot commented Nov 16, 2024

liubo-intel Nov 21, 2024 •

edited

Loading

EgorDuplensky Nov 22, 2024

EgorDuplensky Nov 22, 2024 •

edited

Loading

xczhai Nov 25, 2024 •

edited

Loading

usstq Nov 22, 2024

xczhai Nov 25, 2024

usstq Nov 22, 2024

xczhai Nov 25, 2024

usstq Nov 22, 2024

xczhai Nov 25, 2024

usstq Nov 22, 2024

xczhai Nov 25, 2024

usstq Nov 22, 2024

xczhai Nov 25, 2024

dmitry-gorokhov commented Nov 25, 2024

xczhai commented Nov 25, 2024 •

edited

Loading

github-actions bot commented Dec 24, 2024

[CPU] remove extra convert for fp16 #26755

Are you sure you want to change the base?

[CPU] remove extra convert for fp16 #26755

Conversation

xczhai commented Sep 24, 2024 • edited Loading

Details:

Tickets:

github-actions bot commented Nov 16, 2024

liubo-intel Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EgorDuplensky Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

xczhai Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitry-gorokhov commented Nov 25, 2024

xczhai commented Nov 25, 2024 • edited Loading

github-actions bot commented Dec 24, 2024

xczhai commented Sep 24, 2024 •

edited

Loading

liubo-intel Nov 21, 2024 •

edited

Loading

EgorDuplensky Nov 22, 2024 •

edited

Loading

xczhai Nov 25, 2024 •

edited

Loading

xczhai commented Nov 25, 2024 •

edited

Loading