[CPU] [Snippets] Implement Convert for Snippets on ARM #25815

xuchen-intel · 2024-07-31T03:40:01Z

Details:

Add jit implementation for Convert emitters on ARM
Add jit implementation for Load/Store emitters for precision i32, f16, i8, u8 on ARM
Add Snippets tokenization for Convert on ARM
Enable LoadConvertSaturation and three other counterparts
Test case coverage

Tickets:

CVS-141288
CVS-141294

xuchen-intel · 2024-08-05T05:14:21Z

@dmitry-gorokhov Hi Dmitry, could you please take a look?

eshoguli

load converson: fp16 => u8/s8 and store u8/s8 => fp16 are not supported. Is it expected?

eshoguli · 2024-08-05T20:40:07Z

src/plugins/intel_cpu/tests/functional/shared_tests_instances/snippets/arm/convert.cpp

+
+namespace {
+
+const std::vector<std::pair<std::vector<ov::element::Type>, std::vector<ov::element::Type>>> types_Convert = {


It can be minor, if used functions are covered in other conversions. For example: test case is not covered: store, i32 => i8 / u8, but function cvt_i32_to_byte is used in other cases (actually, I'm not sure about all function input arguments). Please, be sure that used functions are covered.

Not covered conversions:

store: i32 => i8 / u8

store: i32 => f16

store: f32 => i32

store: i32 => f32

load: i8 / u8 => i32

load: f16 => i32

load: i32 => f32

load: f32 => i32

Thanks Edward for the comment! In fact, I added these i32 related test cases during the implementation. Then I removed them to align the tokenization behavior with x64, described here https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp#L1043.

eshoguli · 2024-08-05T21:30:54Z

src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_conversion_emitters.cpp

+                    cvt_byte_to_i32<isa>(h, in_idxs, out_idxs, input_type.is_signed());
+                    cvt_i32_to_f32<isa>(h, out_idxs, out_idxs);
+                    cvt_f32_to_f16<isa>(h, out_idxs, out_idxs);


Suggestion. Is it i8/u8 => f16 correct? Can we simplify here to just two instructions only? Note, please, source code is not tested. If yes, what about other similar cases?

For signed int8:

sshll(out.h8, in.b8, 0); // signed shift left long scvtf(out.h, out.h); // signed integer convert to half precision floating-point

For unsigned int8:

ushll(out.h8, in.b8, 0); // unsigned shift left long ucvtf(out.h, out.h); // unsigned integer convert to half precision floating-point

Good point! Applied. Thanks Edward!

Hi Edward! I find that single conversion instruction between f16 and i16 is only available for archtecture ARMv8.2-A or later versions. As the cpu isa asimd we supported do not distinguish ARMv8 with ARMv8.2-A, I still use three instructions to convert between f16 and i16.

I met crash on conversion SLT on ci machine ie-tests-linux-ubuntu20_arm64-cpu. With the help of validation team, I find that this is a Raspberry (Model name Cortex-A72) with archtecture ARMv8 that do not support single conversion instruction between f16 and i16. So I still use three instructions to make it compatible with ARMv8 platforms.

Please feel free to have further discussions. Thanks! @eshoguli cc' @dmitry-gorokhov

a-sidorova

First part: JIT code is not verified for now

src/plugins/intel_cpu/tests/functional/CMakeLists.txt

src/plugins/intel_cpu/tests/functional/shared_tests_instances/skip_tests_config.cpp

a-sidorova · 2024-08-06T06:19:58Z

src/plugins/intel_cpu/tests/functional/custom/single_layer_tests/classes/conversion.cpp

@@ -63,7 +63,17 @@ void ConvertCPULayerTest::SetUp() {
    auto primitive = selectedType;
    if (primitive.empty())
        primitive = getPrimitiveType();
-    if (!isInOutPrecisionSupported(inPrc, outPrc))
+#if defined(OPENVINO_ARCH_ARM64)


There is method isInOutPrecisionSupported which exclude some test cases for acl on arm. Should we remove them if now Convert will be executed via Snippets?

Thanks Alexandra for the comment! You are right now Convert will be executed via Snippets. Yet tokenization of Snippets has other contraints besides precision. For example, thought i8 is generally supported by Snippets, but some i8 cases will not be tokenized because of for exmaple rank discussed in next comment. For such case, we still need to return false of isInOutPrecisionSupported for acl, when it does not support.

And as I added primitive != "jit" to the condition, isInOutPrecisionSupported will not be executed if primitive already equal to "jit". Please feel free to have further discussions.

src/plugins/intel_cpu/tests/functional/custom/single_layer_tests/classes/conversion.cpp

src/plugins/intel_cpu/src/transformations/snippets/aarch64/pass/snippets_mark_skipped.cpp

src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_load_store_emitters.hpp

src/plugins/intel_cpu/src/emitters/snippets/aarch64/jit_memory_emitters.cpp

xuchen-intel · 2024-08-08T05:35:02Z

load converson: fp16 => u8/s8 and store u8/s8 => fp16 are not supported. Is it expected?

Thanks Edward for the comment! If the input precision is not equal to output precision, then the output precision of Load and the input precision of Store only support f32 and i32. I aligned this behavior with x64 implementation https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/emitters/plugin/x64/jit_load_store_emitters.cpp#L127, I assume the reason of such support is that the output precision of Load and the input precision of Store are intermediate precision that being used to conduct computation in vector registers, and f32 and i32 can maintain accurary.

src/plugins/intel_cpu/tests/functional/shared_tests_instances/skip_tests_config.cpp

src/plugins/intel_cpu/tests/functional/custom/single_layer_tests/classes/conversion.cpp

src/plugins/intel_cpu/src/transformations/snippets/aarch64/pass/snippets_mark_skipped.cpp

src/plugins/intel_cpu/src/emitters/snippets/aarch64/jit_memory_emitters.cpp

src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_load_store_emitters.hpp

src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_load_store_emitters.cpp

a-sidorova · 2024-08-15T13:31:38Z

src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_load_store_emitters.cpp

+                case ov::element::i32:
+                    cvt_f16_to_f32<isa>(h, aux_vec_idxs, aux_vec_idxs);
+                    cvt_f32_to_i32<isa>(h, aux_vec_idxs, out_idxs);


One of the my main concerns about these emitters the following:
at the moment, jit_load_emitter and jit_store_emitters for different precisions looks like Load + Convert or Convert + Store. Then I don't see sense in the passes FuseLoadConvert and FuseStoreConvert 🤔

On x64 we just use the special instructions to load 8 packed 8-bit integers to 8 packed 16-bit integers. Because of that the expression LoadConvert is more efficient than the pair Load + Convert.
Does aarch64 something the same?
If it does not, maybe there is no sense to support Load any precision -> f32/i32 and Store i32/f32 -> any precision? If developers needs this conversion, just call convert emitter 🤔
And no need to register these optimization passes in data flow pipeline in snippets which fuse Convert with memory expression.
By another hand, we can just call convert_emitter in load/store_emitter for precision conversion to support generic case. Then users don't need to think about different precisions of GPR and VEC when they will write own implementations (not snippets).
But I'd prefer first variant (as RISC-ideology😄 )
What do you think? It's open question for discussion

Thanks Alexandra for the comment! I agree that from the perspective of RISC-ideology the first variant is better. Yet I vote for your second variant, for the exact reason you provided. Applying the second variant, we align with the same behavior of x64 when call load/store_emitter. As load/store_emitter will be called in more places in non-snippetes JIT implementations, and for most of the cases they load from source precision to f32 (or store from f32 to dst precision), I believe such end to end second variant will bring more convinience for developers. Please feel free to have further discussions.

@xuchen-intel, emitters for all architectures should be kept as simple as possible. We should not try to align semantics of load/store emitters between x86 and arm.
x86 load/store emitters support intrinsically conversions only because there are hardware instructions that can simultaneously convert&write to memory. If there are no such instructions on arm, we should not try to incorporate convert emitters into load/store emitters.
So my suggestion is to keep load/sore without convert functionality on arm. We just should not call the FuseLoadStoreConvert pass on arm, and appropriate convert emitters would be used automatically. This would allow us to simplify the load/store emitters.

I've thought twice. I think both of you have made good points! And my concern about convinience of usage in non-snippets implementations can be solved by packing load/store and conversion emitters locally. So let's do this separation to make load/store as simple as possible!

For now, I've tried to remove the conversion emitter from load/store, but there show up many test case failures. I need to investigate the issue. So I created a separate ticket 150430 (cc' @dmitry-gorokhov ) to track this task. Thanks @IvanNovoselov and @a-sidorova!

src/plugins/intel_cpu/tests/functional/shared_tests_instances/snippets/arm/convert.cpp

src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_load_store_emitters.cpp

a-sidorova · 2024-08-21T12:34:46Z

src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_load_store_emitters.cpp

+    switch (src_prc_) {
+        case ov::element::f32:
+        case ov::element::i32:
+            load_qbyte<isa>(in_idxs, src_prc_ == dst_prc_ ? out_idxs : aux_vec_idxs);


Do we really need to load in aux_vec_idxs in this case?
Can convert_emitter handle tte same registers on input and output itself?

Absolutely! Double checked, these conversion instructions support in place computation. Applied. Thanks Alexandra!

Great! Thank you for the applying!

@a-sidorova Hi Alexandra! My previous judgement might be wrong!

I reverted the commit "Remove unnecessary aux_vec_idxs(28af19b)", because it causes failures in test suite smoke_LSTMCellCommon. I checked some of these cases, they contains f16->f32 and f32->f16 conversions. Though such conversion cases are also covered by test suite smoke_Snippets_Convert, which successfully passed tests. I believe it is not safe to use inplace register for conversion instructions. Besides, the manual has not explicitly said these conversion instructions support inplace. https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/FCVTL--FCVTL2--Floating-point-Convert-to-higher-precision-Long--vector--?lang=en. Please feel free to have further discussions.

I really believe that it's possible because I found implementations in ACL which use the same instructions (isa as well) and save the result to the same register. So probably there is another bug 🤔
I'd suggest to come back to this question in bounds of the ticket 150430 at least

Agree. Will come back to this question in bounds of the ticket 150430, as aux_vec_idxs is meant to be removed, when conversion emitter is separated from load/store emitters. Thanks Alexandra!

src/plugins/intel_cpu/src/emitters/plugin/aarch64/jit_conversion_emitters.cpp

xuchen-intel · 2024-08-23T11:43:20Z

@xuchen-intel may I ask you to launch benchmark validation (infer_precision=f32 and infer_precision=f16) and accuracy validation please?

I've launched benchmark validation in the waiting queue:
https://ci-dlbenchmark-icv.iotg.sclab.intel.com/job/DL-Benchmark/job/prod/job/WW34-2024.4.0-16419/job/W_ARM_latency_f32_macos13_arm_m2/4/
https://ci-dlbenchmark-icv.iotg.sclab.intel.com/job/DL-Benchmark/job/prod/job/WW34-2024.4.0-16419/job/W_ARM_latency_f32_macos13_arm_m2/5/

Also launched accuracy validation:
https://ci-accuracy-icv.iotg.sclab.intel.com/job/OMZ-Validation/job/try/job/macos/1464/

Thanks @a-sidorova!

xuchen-intel · 2024-08-25T06:55:21Z

@xuchen-intel may I ask you to launch benchmark validation (infer_precision=f32 and infer_precision=f16) and accuracy validation please?

I've launched benchmark validation in the waiting queue: https://ci-dlbenchmark-icv.iotg.sclab.intel.com/job/DL-Benchmark/job/prod/job/WW34-2024.4.0-16419/job/W_ARM_latency_f32_macos13_arm_m2/4/ https://ci-dlbenchmark-icv.iotg.sclab.intel.com/job/DL-Benchmark/job/prod/job/WW34-2024.4.0-16419/job/W_ARM_latency_f32_macos13_arm_m2/5/

Also launched accuracy validation: https://ci-accuracy-icv.iotg.sclab.intel.com/job/OMZ-Validation/job/try/job/macos/1464/

Thanks @a-sidorova!

@a-sidorova Performance result is good. http://benchmarks.sclab.intel.com/index.py?view=dlb-multi-bapp-lat.yml&target_builds=prod/WW34-2024.4.0-16419/W_ARM_latency_f32_macos13_arm_m2/10&reference_builds=prod/WW34-2024.4.0-16419/W_ARM_latency_f32_macos13_arm_m2/11&selected_frameworks=&selected_precisions=
Accuracy validation is still on going, which is slow. Another launch of accuracy validation for reference is in the waiting queue. Judging from current progress, it will take 1.5 ~ 2 more weeks to get the results of accuracy validations. Thanks Alexandra!
cc' @dmitry-gorokhov @IvanNovoselov

Update:
Accuracy result is good. Details are attached in ticket 141294.

…it#25815) ### Details: - *Add jit implementation for Convert emitters on ARM* - *Add jit implementation for Load/Store emitters for precision i32, f16, i8, u8 on ARM* - *Add Snippets tokenization for Convert on ARM* - *Enable LoadConvertSaturation and three other counterparts* - *Test case coverage* ### Tickets: - *[CVS-141288](https://jira.devtools.intel.com/browse/CVS-141288)* - *[CVS-141294](https://jira.devtools.intel.com/browse/CVS-141294)*

xuchen-intel added the category: CPU OpenVINO CPU plugin label Jul 31, 2024

xuchen-intel requested review from a team as code owners July 31, 2024 03:40

github-actions bot added the category: build OpenVINO cmake script / infra label Jul 31, 2024

xuchen-intel force-pushed the feature/arm_snippets_convert branch 3 times, most recently from 0296171 to 5606d64 Compare August 5, 2024 02:53

xuchen-intel changed the title ~~[Draft] [CPU] [Snippets] Implement Convert for Snippets on ARM~~ [CPU] [Snippets] Implement Convert for Snippets on ARM Aug 5, 2024

a-sidorova self-assigned this Aug 5, 2024

dmitry-gorokhov added this to the 2024.4 milestone Aug 5, 2024

dmitry-gorokhov assigned eshoguli Aug 5, 2024

eshoguli reviewed Aug 5, 2024

View reviewed changes

a-sidorova reviewed Aug 6, 2024

View reviewed changes

xuchen-intel force-pushed the feature/arm_snippets_convert branch 2 times, most recently from 66ca73f to 605bdb5 Compare August 8, 2024 05:15

xuchen-intel requested review from eshoguli and a-sidorova August 8, 2024 07:22

xuchen-intel force-pushed the feature/arm_snippets_convert branch from 605bdb5 to 9229126 Compare August 9, 2024 04:38

a-sidorova reviewed Aug 15, 2024

View reviewed changes

xuchen-intel force-pushed the feature/arm_snippets_convert branch from 9229126 to c727fdc Compare August 20, 2024 02:42

xuchen-intel requested a review from a-sidorova August 20, 2024 03:10

xuchen-intel force-pushed the feature/arm_snippets_convert branch from c727fdc to 52d8898 Compare August 20, 2024 03:20

xuchen-intel added the do not merge label Aug 20, 2024

xuchen-intel force-pushed the feature/arm_snippets_convert branch from fe1337a to 6eef6da Compare August 21, 2024 08:29

xuchen-intel removed the do not merge label Aug 21, 2024

a-sidorova reviewed Aug 21, 2024

View reviewed changes

xuchen-intel added 20 commits August 23, 2024 10:46

Update isSuitableConvert

1ea1b6b

Update jit_store_memory_emitter constructor

da5a32d

Update enum class arithmetic_mode

7e8d9c0

Call convert_emitter in load/store_emitter

1747147

Update arguments for instructions regarding f16 conversion

f7d4f11

Make conversion between f16 and i8 compatible with ARMv8

b21c9dd

Update XReg prc

aa0b7eb

Remove unnecessary aux_vec_idxs

f879613

Update mov logic

6365b00

Update swtich-case for identical input and output precisions

19a43de

Update template for conversion functions

cb094a7

Apply mov for conversion between i8 and u8 for truncation mode

d4dd709

Update jit_convert_emitter constructor

66b70da

revert removing unnecessary aux_vec_idxs

f7ed03c

Make conversion functions to be member functions of base class

0fde0e2

Add assertion

e547663

Update SNIPPETS_REGISTER_PASS_RELATIVE

431bf95

Apply convert_truncation_emitter

cb497a1

Add condition for creating conversion emitters

c16fd38

Update assertion for element number

dda5225

IvanNovoselov approved these changes Aug 23, 2024

View reviewed changes

ilya-lavrenov added the platform: arm OpenVINO on ARM / ARM64 label Aug 29, 2024

xuchen-intel added 2 commits September 2, 2024 15:38

Merge branch 'master' into feature/arm_snippets_convert

7ec419a

Merge branch 'master' into feature/arm_snippets_convert

11a8d58

IvanNovoselov added this pull request to the merge queue Sep 6, 2024

Merged via the queue into openvinotoolkit:master with commit 48a6777 Sep 6, 2024
149 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] [Snippets] Implement Convert for Snippets on ARM #25815

[CPU] [Snippets] Implement Convert for Snippets on ARM #25815

xuchen-intel commented Jul 31, 2024

xuchen-intel commented Aug 5, 2024

eshoguli left a comment •

edited

Loading

eshoguli Aug 5, 2024

xuchen-intel Aug 8, 2024

eshoguli Aug 5, 2024

xuchen-intel Aug 8, 2024

xuchen-intel Aug 21, 2024

a-sidorova left a comment

a-sidorova Aug 6, 2024

xuchen-intel Aug 8, 2024

xuchen-intel commented Aug 8, 2024

a-sidorova Aug 15, 2024

xuchen-intel Aug 20, 2024

IvanNovoselov Aug 22, 2024

xuchen-intel Aug 23, 2024

a-sidorova Aug 21, 2024

xuchen-intel Aug 22, 2024

a-sidorova Aug 22, 2024

xuchen-intel Aug 23, 2024

a-sidorova Aug 23, 2024

xuchen-intel Aug 23, 2024

xuchen-intel commented Aug 23, 2024

xuchen-intel commented Aug 25, 2024 •

edited

Loading


		namespace {

		const std::vector<std::pair<std::vector<ov::element::Type>, std::vector<ov::element::Type>>> types_Convert = {

[CPU] [Snippets] Implement Convert for Snippets on ARM #25815

[CPU] [Snippets] Implement Convert for Snippets on ARM #25815

Conversation

xuchen-intel commented Jul 31, 2024

Details:

Tickets:

xuchen-intel commented Aug 5, 2024

eshoguli left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-sidorova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuchen-intel commented Aug 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuchen-intel commented Aug 23, 2024

xuchen-intel commented Aug 25, 2024 • edited Loading

eshoguli left a comment •

edited

Loading

xuchen-intel commented Aug 25, 2024 •

edited

Loading