-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] [Snippets] Implement Convert for Snippets on ARM #25815
[CPU] [Snippets] Implement Convert for Snippets on ARM #25815
Conversation
0296171
to
5606d64
Compare
@dmitry-gorokhov Hi Dmitry, could you please take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load converson: fp16
=> u8
/s8
and store u8
/s8
=> fp16
are not supported. Is it expected?
|
||
namespace { | ||
|
||
const std::vector<std::pair<std::vector<ov::element::Type>, std::vector<ov::element::Type>>> types_Convert = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be minor, if used functions are covered in other conversions. For example: test case is not covered: store, i32
=> i8
/ u8
, but function cvt_i32_to_byte
is used in other cases (actually, I'm not sure about all function input arguments). Please, be sure that used functions are covered.
Not covered conversions:
- store:
i32
=>i8
/u8
- store:
i32
=>f16
- store:
f32
=>i32
- store:
i32
=>f32
- load:
i8
/u8
=>i32
- load:
f16
=>i32
- load:
i32
=>f32
- load:
f32
=>i32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Edward for the comment! In fact, I added these i32 related test cases during the implementation. Then I removed them to align the tokenization behavior with x64, described here https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp#L1043.
cvt_byte_to_i32<isa>(h, in_idxs, out_idxs, input_type.is_signed()); | ||
cvt_i32_to_f32<isa>(h, out_idxs, out_idxs); | ||
cvt_f32_to_f16<isa>(h, out_idxs, out_idxs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion. Is it i8
/u8
=> f16
correct? Can we simplify here to just two instructions only? Note, please, source code is not tested. If yes, what about other similar cases?
For signed int8:
sshll(out.h8, in.b8, 0); // signed shift left long
scvtf(out.h, out.h); // signed integer convert to half precision floating-point
For unsigned int8:
ushll(out.h8, in.b8, 0); // unsigned shift left long
ucvtf(out.h, out.h); // unsigned integer convert to half precision floating-point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Applied. Thanks Edward!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Edward! I find that single conversion instruction between f16 and i16 is only available for archtecture ARMv8.2-A or later versions. As the cpu isa asimd we supported do not distinguish ARMv8 with ARMv8.2-A, I still use three instructions to convert between f16 and i16.
I met crash on conversion SLT on ci machine ie-tests-linux-ubuntu20_arm64-cpu. With the help of validation team, I find that this is a Raspberry (Model name Cortex-A72) with archtecture ARMv8 that do not support single conversion instruction between f16 and i16. So I still use three instructions to make it compatible with ARMv8 platforms.
Please feel free to have further discussions. Thanks! @eshoguli cc' @dmitry-gorokhov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First part: JIT code is not verified for now
src/plugins/intel_cpu/tests/functional/shared_tests_instances/skip_tests_config.cpp
Show resolved
Hide resolved
src/plugins/intel_cpu/tests/functional/shared_tests_instances/skip_tests_config.cpp
Outdated
Show resolved
Hide resolved
@@ -63,7 +63,17 @@ void ConvertCPULayerTest::SetUp() { | |||
auto primitive = selectedType; | |||
if (primitive.empty()) | |||
primitive = getPrimitiveType(); | |||
if (!isInOutPrecisionSupported(inPrc, outPrc)) | |||
#if defined(OPENVINO_ARCH_ARM64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is method isInOutPrecisionSupported
which exclude some test cases for acl on arm. Should we remove them if now Convert
will be executed via Snippets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Alexandra for the comment! You are right now Convert will be executed via Snippets. Yet tokenization of Snippets has other contraints besides precision. For example, thought i8 is generally supported by Snippets, but some i8 cases will not be tokenized because of for exmaple rank discussed in next comment. For such case, we still need to return false of isInOutPrecisionSupported for acl, when it does not support.
And as I added primitive != "jit"
to the condition, isInOutPrecisionSupported will not be executed if primitive already equal to "jit". Please feel free to have further discussions.
src/plugins/intel_cpu/tests/functional/custom/single_layer_tests/classes/conversion.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/snippets/aarch64/pass/snippets_mark_skipped.cpp
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_load_store_emitters.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/snippets/aarch64/jit_memory_emitters.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/snippets/aarch64/jit_memory_emitters.cpp
Outdated
Show resolved
Hide resolved
66ca73f
to
605bdb5
Compare
Thanks Edward for the comment! If the input precision is not equal to output precision, then the output precision of Load and the input precision of Store only support f32 and i32. I aligned this behavior with x64 implementation https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/emitters/plugin/x64/jit_load_store_emitters.cpp#L127, I assume the reason of such support is that the output precision of Load and the input precision of Store are intermediate precision that being used to conduct computation in vector registers, and f32 and i32 can maintain accurary. |
605bdb5
to
9229126
Compare
src/plugins/intel_cpu/tests/functional/shared_tests_instances/skip_tests_config.cpp
Show resolved
Hide resolved
src/plugins/intel_cpu/tests/functional/custom/single_layer_tests/classes/conversion.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/snippets/aarch64/pass/snippets_mark_skipped.cpp
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/snippets/aarch64/jit_memory_emitters.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_load_store_emitters.hpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_load_store_emitters.cpp
Outdated
Show resolved
Hide resolved
case ov::element::i32: | ||
cvt_f16_to_f32<isa>(h, aux_vec_idxs, aux_vec_idxs); | ||
cvt_f32_to_i32<isa>(h, aux_vec_idxs, out_idxs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the my main concerns about these emitters the following:
at the moment, jit_load_emitter
and jit_store_emitters
for different precisions looks like Load + Convert
or Convert + Store
. Then I don't see sense in the passes FuseLoadConvert
and FuseStoreConvert
🤔
On x64 we just use the special instructions to load 8 packed 8-bit integers to 8 packed 16-bit integers. Because of that the expression LoadConvert
is more efficient than the pair Load
+ Convert
.
Does aarch64 something the same?
If it does not, maybe there is no sense to support Load
any precision -> f32/i32
and Store
i32/f32 -> any precision
? If developers needs this conversion, just call convert emitter 🤔
And no need to register these optimization passes in data flow pipeline in snippets which fuse Convert
with memory expression.
By another hand, we can just call convert_emitter
in load/store_emitter
for precision conversion to support generic case. Then users don't need to think about different precisions of GPR and VEC when they will write own implementations (not snippets).
But I'd prefer first variant (as RISC-ideology😄 )
What do you think? It's open question for discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Alexandra for the comment! I agree that from the perspective of RISC-ideology the first variant is better. Yet I vote for your second variant, for the exact reason you provided. Applying the second variant, we align with the same behavior of x64 when call load/store_emitter. As load/store_emitter will be called in more places in non-snippetes JIT implementations, and for most of the cases they load from source precision to f32 (or store from f32 to dst precision), I believe such end to end second variant will bring more convinience for developers. Please feel free to have further discussions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xuchen-intel, emitters for all architectures should be kept as simple as possible. We should not try to align semantics of load/store emitters between x86 and arm.
x86 load/store emitters support intrinsically conversions only because there are hardware instructions that can simultaneously convert&write to memory. If there are no such instructions on arm, we should not try to incorporate convert emitters into load/store emitters.
So my suggestion is to keep load/sore without convert functionality on arm. We just should not call the FuseLoadStoreConvert
pass on arm, and appropriate convert emitters would be used automatically. This would allow us to simplify the load/store emitters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've thought twice. I think both of you have made good points! And my concern about convinience of usage in non-snippets implementations can be solved by packing load/store and conversion emitters locally. So let's do this separation to make load/store as simple as possible!
For now, I've tried to remove the conversion emitter from load/store, but there show up many test case failures. I need to investigate the issue. So I created a separate ticket 150430 (cc' @dmitry-gorokhov ) to track this task. Thanks @IvanNovoselov and @a-sidorova!
9229126
to
c727fdc
Compare
c727fdc
to
52d8898
Compare
fe1337a
to
6eef6da
Compare
src/plugins/intel_cpu/tests/functional/shared_tests_instances/snippets/arm/convert.cpp
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_load_store_emitters.cpp
Outdated
Show resolved
Hide resolved
switch (src_prc_) { | ||
case ov::element::f32: | ||
case ov::element::i32: | ||
load_qbyte<isa>(in_idxs, src_prc_ == dst_prc_ ? out_idxs : aux_vec_idxs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to load in aux_vec_idxs
in this case?
Can convert_emitter
handle tte same registers on input and output itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely! Double checked, these conversion instructions support in place computation. Applied. Thanks Alexandra!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Thank you for the applying!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a-sidorova Hi Alexandra! My previous judgement might be wrong!
I reverted the commit "Remove unnecessary aux_vec_idxs(28af19b)", because it causes failures in test suite smoke_LSTMCellCommon. I checked some of these cases, they contains f16->f32 and f32->f16 conversions. Though such conversion cases are also covered by test suite smoke_Snippets_Convert, which successfully passed tests. I believe it is not safe to use inplace register for conversion instructions. Besides, the manual has not explicitly said these conversion instructions support inplace. https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/FCVTL--FCVTL2--Floating-point-Convert-to-higher-precision-Long--vector--?lang=en. Please feel free to have further discussions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really believe that it's possible because I found implementations in ACL which use the same instructions (isa as well) and save the result to the same register. So probably there is another bug 🤔
I'd suggest to come back to this question in bounds of the ticket 150430 at least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Will come back to this question in bounds of the ticket 150430, as aux_vec_idxs is meant to be removed, when conversion emitter is separated from load/store emitters. Thanks Alexandra!
src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_conversion_emitters.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_conversion_emitters.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_conversion_emitters.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_conversion_emitters.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_conversion_emitters.cpp
Outdated
Show resolved
Hide resolved
I've launched benchmark validation in the waiting queue: Also launched accuracy validation: Thanks @a-sidorova! |
…it#25815) ### Details: - *Add jit implementation for Convert emitters on ARM* - *Add jit implementation for Load/Store emitters for precision i32, f16, i8, u8 on ARM* - *Add Snippets tokenization for Convert on ARM* - *Enable LoadConvertSaturation and three other counterparts* - *Test case coverage* ### Tickets: - *[CVS-141288](https://jira.devtools.intel.com/browse/CVS-141288)* - *[CVS-141294](https://jira.devtools.intel.com/browse/CVS-141294)*
Details:
Tickets: