-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] remove extra convert for fp16 #26755
base: master
Are you sure you want to change the base?
Conversation
2643652
to
cd899d7
Compare
- remove Convert for ReadValue node
cd899d7
to
36578f0
Compare
This PR will be closed in a week because of 2 weeks of no activity. |
- remove Convert for ReadValue node
Xc/debug f16 convert
0dfa834
to
3491756
Compare
c7601f3
to
6d362a0
Compare
6d362a0
to
5901ec9
Compare
false); | ||
need_convert_input_output_precision, | ||
save_original_precision_attribute); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @xczhai : not a review comments, just a question: from the description of this 'save_original_precision_attribute' param, it seems used to save the original precision of this node. but I didn't find any change of such original precisions process in this pr, so which part of logic will be effected based on this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here, where it is actually being used and what logic does it affect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, an original name store_original_precision_as_rt_attribute
better explains the idea behind this flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liubo-intel @EgorDuplensky
answer your question.
store_original_precision_as_rt_attribute
will work in the line https://github.com/openvinotoolkit/openvino/blob/master/src/common/transformations/src/transformations/convert_precision.cpp#L426. It means that such pass will insert aConvert
node afterReadValue
. But it is not necessary in f16 hint. As a result, the following passsdpafusion
cannot match because such extraConvert
.- Both CPU and GPU call
ConvertPrecision
to implf16
conversion. In most cases, the output graph should be similar.manager.register_pass<ov::pass::ConvertPrecision>(fp_convert_precision_map, GPU
. I think it is good f16 reference since gpu starts f16 earlier than cpu.
if (m_to_f16) { | ||
vcvtps2ph(ptr[dst + loop_i * 2], zmm0, 0x4); | ||
vcvtps2ph(ptr[dst + loop_i * 2 + 32], zmm2, 0x4); | ||
if (m_out_f32 && m_to_f16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that m_out_f32
should override m_to_f16
flag? I mean, once a convert to fp32 is fused, it should always store f32 result, right? so if (m_out_f32)
should be enough, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that
m_out_f32
should overridem_to_f16
flag? I mean, once a convert to fp32 is fused, it should always store f32 result, right? soif (m_out_f32)
should be enough, right?
@usstq unify this logic. replace m_to_16
with m_output_type
to mark possible output precision.
if (m_to_f16) { | ||
vcvtps2ph(ptr[dst + loop_i * 2], zmm0, 0x4); | ||
vcvtps2ph(ptr[dst + loop_i * 2 + 32], zmm2, 0x4); | ||
if (m_out_f32 && m_to_f16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
@usstq update as above.
if (!mlp_node) { | ||
return false; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a check here to make sure convert is the only child of mlp node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a check here to make sure convert is the only child of mlp node.
@usstq add has_only_child
check
const auto& m_fc = pattern_map.at(fc).get_node_shared_ptr(); | ||
const auto& m_convert = pattern_map.at(convert).get_node_shared_ptr(); | ||
auto output_type = m_convert->get_output_element_type(0); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a check here to make sure convert is the only child of fc node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a check here to make sure convert is the only child of fc node.
@usstq update as above.
@@ -25,6 +25,7 @@ class LLMMLPNode : public ov::op::Op { | |||
int hidden_size; | |||
int up_size; | |||
bool gate_up_combined; | |||
bool tail_f32 = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add ov::element::Type output_type = ov::element::undefined;
instead, to be consistent with FullyConnectedNode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add
ov::element::Type output_type = ov::element::undefined;
instead, to be consistent withFullyConnectedNode
@usstq I see. update the spec
db91b52
to
050df44
Compare
050df44
to
688d26b
Compare
@xczhai @usstq I have couple of concerns regarding current deisgn:
What is you thoughts on that? |
@dmitry-gorokhov
|
This PR will be closed in a week because of 2 weeks of no activity. |
9c02940
to
0dba6ff
Compare
Details:
In order to remove the extra convert node imported by
ConvertPrecision
pass, take some optimizations on the specific patterns. The detailed passes are listed as below.After that, sdpa/rms/fc/llmmlp can fuse the following convert node and meet the fusion pattern. As a result, the preformance gains.
fp16 performance is very close to bf16.
Tickets: