Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] [Snippets] Implement Convert for Snippets on ARM #25815

Merged

Conversation

xuchen-intel
Copy link
Contributor

Details:

  • Add jit implementation for Convert emitters on ARM
  • Add jit implementation for Load/Store emitters for precision i32, f16, i8, u8 on ARM
  • Add Snippets tokenization for Convert on ARM
  • Enable LoadConvertSaturation and three other counterparts
  • Test case coverage

Tickets:

@xuchen-intel xuchen-intel added the category: CPU OpenVINO CPU plugin label Jul 31, 2024
@xuchen-intel xuchen-intel requested review from a team as code owners July 31, 2024 03:40
@github-actions github-actions bot added the category: build OpenVINO cmake script / infra label Jul 31, 2024
@xuchen-intel xuchen-intel force-pushed the feature/arm_snippets_convert branch 3 times, most recently from 0296171 to 5606d64 Compare August 5, 2024 02:53
@xuchen-intel xuchen-intel changed the title [Draft] [CPU] [Snippets] Implement Convert for Snippets on ARM [CPU] [Snippets] Implement Convert for Snippets on ARM Aug 5, 2024
@xuchen-intel
Copy link
Contributor Author

@dmitry-gorokhov Hi Dmitry, could you please take a look?

@a-sidorova a-sidorova self-assigned this Aug 5, 2024
@dmitry-gorokhov dmitry-gorokhov added this to the 2024.4 milestone Aug 5, 2024
Copy link
Contributor

@eshoguli eshoguli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load converson: fp16 => u8/s8 and store u8/s8 => fp16 are not supported. Is it expected?


namespace {

const std::vector<std::pair<std::vector<ov::element::Type>, std::vector<ov::element::Type>>> types_Convert = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be minor, if used functions are covered in other conversions. For example: test case is not covered: store, i32 => i8 / u8, but function cvt_i32_to_byte is used in other cases (actually, I'm not sure about all function input arguments). Please, be sure that used functions are covered.

Not covered conversions:

  • store: i32 => i8 / u8
  • store: i32 => f16
  • store: f32 => i32
  • store: i32 => f32
  • load: i8 / u8 => i32
  • load: f16 => i32
  • load: i32 => f32
  • load: f32 => i32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Edward for the comment! In fact, I added these i32 related test cases during the implementation. Then I removed them to align the tokenization behavior with x64, described here https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp#L1043.

Comment on lines 78 to 193
cvt_byte_to_i32<isa>(h, in_idxs, out_idxs, input_type.is_signed());
cvt_i32_to_f32<isa>(h, out_idxs, out_idxs);
cvt_f32_to_f16<isa>(h, out_idxs, out_idxs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion. Is it i8/u8 => f16 correct? Can we simplify here to just two instructions only? Note, please, source code is not tested. If yes, what about other similar cases?

For signed int8:

sshll(out.h8, in.b8, 0);  // signed shift left long
scvtf(out.h, out.h);      // signed integer convert to half precision floating-point

For unsigned int8:

ushll(out.h8, in.b8, 0);   // unsigned shift left long
ucvtf(out.h, out.h);       // unsigned integer convert to half precision floating-point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Applied. Thanks Edward!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Edward! I find that single conversion instruction between f16 and i16 is only available for archtecture ARMv8.2-A or later versions. As the cpu isa asimd we supported do not distinguish ARMv8 with ARMv8.2-A, I still use three instructions to convert between f16 and i16.

I met crash on conversion SLT on ci machine ie-tests-linux-ubuntu20_arm64-cpu. With the help of validation team, I find that this is a Raspberry (Model name Cortex-A72) with archtecture ARMv8 that do not support single conversion instruction between f16 and i16. So I still use three instructions to make it compatible with ARMv8 platforms.

Please feel free to have further discussions. Thanks! @eshoguli cc' @dmitry-gorokhov

Copy link
Contributor

@a-sidorova a-sidorova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First part: JIT code is not verified for now

src/plugins/intel_cpu/tests/functional/CMakeLists.txt Outdated Show resolved Hide resolved
@@ -63,7 +63,17 @@ void ConvertCPULayerTest::SetUp() {
auto primitive = selectedType;
if (primitive.empty())
primitive = getPrimitiveType();
if (!isInOutPrecisionSupported(inPrc, outPrc))
#if defined(OPENVINO_ARCH_ARM64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is method isInOutPrecisionSupported which exclude some test cases for acl on arm. Should we remove them if now Convert will be executed via Snippets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alexandra for the comment! You are right now Convert will be executed via Snippets. Yet tokenization of Snippets has other contraints besides precision. For example, thought i8 is generally supported by Snippets, but some i8 cases will not be tokenized because of for exmaple rank discussed in next comment. For such case, we still need to return false of isInOutPrecisionSupported for acl, when it does not support.

And as I added primitive != "jit" to the condition, isInOutPrecisionSupported will not be executed if primitive already equal to "jit". Please feel free to have further discussions.

@xuchen-intel xuchen-intel force-pushed the feature/arm_snippets_convert branch 2 times, most recently from 66ca73f to 605bdb5 Compare August 8, 2024 05:15
@xuchen-intel
Copy link
Contributor Author

load converson: fp16 => u8/s8 and store u8/s8 => fp16 are not supported. Is it expected?

Thanks Edward for the comment! If the input precision is not equal to output precision, then the output precision of Load and the input precision of Store only support f32 and i32. I aligned this behavior with x64 implementation https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/emitters/plugin/x64/jit_load_store_emitters.cpp#L127, I assume the reason of such support is that the output precision of Load and the input precision of Store are intermediate precision that being used to conduct computation in vector registers, and f32 and i32 can maintain accurary.

@xuchen-intel xuchen-intel force-pushed the feature/arm_snippets_convert branch from 605bdb5 to 9229126 Compare August 9, 2024 04:38
Comment on lines 167 to 169
case ov::element::i32:
cvt_f16_to_f32<isa>(h, aux_vec_idxs, aux_vec_idxs);
cvt_f32_to_i32<isa>(h, aux_vec_idxs, out_idxs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the my main concerns about these emitters the following:
at the moment, jit_load_emitter and jit_store_emitters for different precisions looks like Load + Convert or Convert + Store. Then I don't see sense in the passes FuseLoadConvert and FuseStoreConvert 🤔

On x64 we just use the special instructions to load 8 packed 8-bit integers to 8 packed 16-bit integers. Because of that the expression LoadConvert is more efficient than the pair Load + Convert.
Does aarch64 something the same?
If it does not, maybe there is no sense to support Load any precision -> f32/i32 and Store i32/f32 -> any precision? If developers needs this conversion, just call convert emitter 🤔
And no need to register these optimization passes in data flow pipeline in snippets which fuse Convert with memory expression.
By another hand, we can just call convert_emitter in load/store_emitter for precision conversion to support generic case. Then users don't need to think about different precisions of GPR and VEC when they will write own implementations (not snippets).
But I'd prefer first variant (as RISC-ideology😄 )
What do you think? It's open question for discussion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alexandra for the comment! I agree that from the perspective of RISC-ideology the first variant is better. Yet I vote for your second variant, for the exact reason you provided. Applying the second variant, we align with the same behavior of x64 when call load/store_emitter. As load/store_emitter will be called in more places in non-snippetes JIT implementations, and for most of the cases they load from source precision to f32 (or store from f32 to dst precision), I believe such end to end second variant will bring more convinience for developers. Please feel free to have further discussions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuchen-intel, emitters for all architectures should be kept as simple as possible. We should not try to align semantics of load/store emitters between x86 and arm.
x86 load/store emitters support intrinsically conversions only because there are hardware instructions that can simultaneously convert&write to memory. If there are no such instructions on arm, we should not try to incorporate convert emitters into load/store emitters.
So my suggestion is to keep load/sore without convert functionality on arm. We just should not call the FuseLoadStoreConvert pass on arm, and appropriate convert emitters would be used automatically. This would allow us to simplify the load/store emitters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought twice. I think both of you have made good points! And my concern about convinience of usage in non-snippets implementations can be solved by packing load/store and conversion emitters locally. So let's do this separation to make load/store as simple as possible!

For now, I've tried to remove the conversion emitter from load/store, but there show up many test case failures. I need to investigate the issue. So I created a separate ticket 150430 (cc' @dmitry-gorokhov ) to track this task. Thanks @IvanNovoselov and @a-sidorova!

@xuchen-intel xuchen-intel force-pushed the feature/arm_snippets_convert branch from 9229126 to c727fdc Compare August 20, 2024 02:42
@xuchen-intel xuchen-intel force-pushed the feature/arm_snippets_convert branch from c727fdc to 52d8898 Compare August 20, 2024 03:20
@xuchen-intel xuchen-intel force-pushed the feature/arm_snippets_convert branch from fe1337a to 6eef6da Compare August 21, 2024 08:29
switch (src_prc_) {
case ov::element::f32:
case ov::element::i32:
load_qbyte<isa>(in_idxs, src_prc_ == dst_prc_ ? out_idxs : aux_vec_idxs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to load in aux_vec_idxs in this case?
Can convert_emitter handle tte same registers on input and output itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! Double checked, these conversion instructions support in place computation. Applied. Thanks Alexandra!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thank you for the applying!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a-sidorova Hi Alexandra! My previous judgement might be wrong!

I reverted the commit "Remove unnecessary aux_vec_idxs(28af19b)", because it causes failures in test suite smoke_LSTMCellCommon. I checked some of these cases, they contains f16->f32 and f32->f16 conversions. Though such conversion cases are also covered by test suite smoke_Snippets_Convert, which successfully passed tests. I believe it is not safe to use inplace register for conversion instructions. Besides, the manual has not explicitly said these conversion instructions support inplace. https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/FCVTL--FCVTL2--Floating-point-Convert-to-higher-precision-Long--vector--?lang=en. Please feel free to have further discussions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really believe that it's possible because I found implementations in ACL which use the same instructions (isa as well) and save the result to the same register. So probably there is another bug 🤔
I'd suggest to come back to this question in bounds of the ticket 150430 at least

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Will come back to this question in bounds of the ticket 150430, as aux_vec_idxs is meant to be removed, when conversion emitter is separated from load/store emitters. Thanks Alexandra!

@xuchen-intel
Copy link
Contributor Author

@xuchen-intel
Copy link
Contributor Author

xuchen-intel commented Aug 25, 2024

@xuchen-intel may I ask you to launch benchmark validation (infer_precision=f32 and infer_precision=f16) and accuracy validation please?

I've launched benchmark validation in the waiting queue: https://ci-dlbenchmark-icv.iotg.sclab.intel.com/job/DL-Benchmark/job/prod/job/WW34-2024.4.0-16419/job/W_ARM_latency_f32_macos13_arm_m2/4/ https://ci-dlbenchmark-icv.iotg.sclab.intel.com/job/DL-Benchmark/job/prod/job/WW34-2024.4.0-16419/job/W_ARM_latency_f32_macos13_arm_m2/5/

Also launched accuracy validation: https://ci-accuracy-icv.iotg.sclab.intel.com/job/OMZ-Validation/job/try/job/macos/1464/

Thanks @a-sidorova!

@a-sidorova Performance result is good. http://benchmarks.sclab.intel.com/index.py?view=dlb-multi-bapp-lat.yml&target_builds=prod/WW34-2024.4.0-16419/W_ARM_latency_f32_macos13_arm_m2/10&reference_builds=prod/WW34-2024.4.0-16419/W_ARM_latency_f32_macos13_arm_m2/11&selected_frameworks=&selected_precisions=
Accuracy validation is still on going, which is slow. Another launch of accuracy validation for reference is in the waiting queue. Judging from current progress, it will take 1.5 ~ 2 more weeks to get the results of accuracy validations. Thanks Alexandra!
cc' @dmitry-gorokhov @IvanNovoselov

Update:
Accuracy result is good. Details are attached in ticket 141294.

@ilya-lavrenov ilya-lavrenov added the platform: arm OpenVINO on ARM / ARM64 label Aug 29, 2024
@IvanNovoselov IvanNovoselov added this pull request to the merge queue Sep 6, 2024
Merged via the queue into openvinotoolkit:master with commit 48a6777 Sep 6, 2024
149 checks passed
ababushk pushed a commit to ababushk/openvino that referenced this pull request Sep 11, 2024
…it#25815)

### Details:
 - *Add jit implementation for Convert emitters on ARM*
- *Add jit implementation for Load/Store emitters for precision i32,
f16, i8, u8 on ARM*
 - *Add Snippets tokenization for Convert on ARM*
 - *Enable LoadConvertSaturation and three other counterparts*
 - *Test case coverage*

### Tickets:
 - *[CVS-141288](https://jira.devtools.intel.com/browse/CVS-141288)*
 - *[CVS-141294](https://jira.devtools.intel.com/browse/CVS-141294)*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: build OpenVINO cmake script / infra category: CPU OpenVINO CPU plugin platform: arm OpenVINO on ARM / ARM64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants