Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARM] [SDPA] SVE implementation of MHASingleToken for FP32 #27273

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

NishantPrabhuFujitsu
Copy link

@NishantPrabhuFujitsu NishantPrabhuFujitsu commented Oct 28, 2024

Details:

  • Adds SVE FP32 implementations for functions called during execution of MHASingleToken for SVE-128, SVE-256 and SVE-512 platforms.
  • SVE implementations are compiled only if runtime support for SVE is detected on the hardware, otherwise it falls back to Neon.
  • Adds a new implementation for exponential function exp_ps_<isa> using fewer FMA operations. Executes ~18% faster and has better output precision.

Note: I am aware of the Neon FP16 implementation of SDPA added recently. To accommodate for this, the current SVE changes will be used only if the hardware does not have ARM FP16 support. I will follow up with SVE FP16 implementations soon.

[SVE] Benchmarking results

Below are the benchmarking results of execution time of each ported function. Measurements were performed by running each function individually on dummy inputs (128 fp32 elements) for 1,000,000 iterations and computing average time (in micro-seconds).

image

Execution time of MHASingleToken as a whole was also measured for two LLMs, the results of which are shown below. For LlaMA-3-8B, the SVE-128 and SVE-512 systems at my disposal did not have enough memory, so only SVE-256 results are shown. While there is an improvement overall, these results could be contaminated with run-to-run variation due to the small execution time of the kernel.

Benchmarking details: Prompt length of 108 tokens was used; total time for generating 50 tokens was measured and average execution time was computed.

image

New exponential implementation

It is based on the discussion in these slides (this is based on a past talk in Fujitsu hence the document is in Japanese, sorry!). The algorithm followed is slightly different from the current implementation, in that it uses fexpa instruction available on ARM and requires only 3 Taylor expansion terms (2 FMA operations) to be precise until the 8th decimal place.

Our benchmarking results showed this implementation to be 44%-58% faster than the existing Neon implementation. It is ~18% faster than the SVE implementation of the current algorithm in Neon.

image

In this PR, the new implementation is called by default. The SVE port of the existing Neon implementation has also been retained, if needed.

@NishantPrabhuFujitsu NishantPrabhuFujitsu requested review from a team as code owners October 28, 2024 11:26
@NishantPrabhuFujitsu NishantPrabhuFujitsu requested review from ilya-lavrenov and removed request for a team October 28, 2024 11:26
@github-actions github-actions bot added category: CPU OpenVINO CPU plugin category: dependency_changes Pull requests that update a dependency file no-match-files category: NPU OpenVINO NPU plugin labels Oct 28, 2024
@sys-openvino-ci sys-openvino-ci added the ExternalPR External contributor label Oct 28, 2024
@dmitry-gorokhov dmitry-gorokhov self-assigned this Oct 28, 2024
@github-actions github-actions bot added the category: build OpenVINO cmake script / infra label Oct 28, 2024
@ilya-lavrenov ilya-lavrenov added the platform: arm OpenVINO on ARM / ARM64 label Oct 28, 2024
.gitignore Outdated Show resolved Hide resolved
.gitmodules Outdated
@@ -87,3 +87,6 @@
[submodule "src/plugins/intel_cpu/thirdparty/shl"]
path = src/plugins/intel_cpu/thirdparty/shl
url = https://github.com/openvinotoolkit/shl.git
[submodule "thirdparty/src/plugins/intel_npu/thirdparty/level-zero"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove all submodule changes from the PR.
image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have (tried my best to) clean this up. Please let me know if it's not reverted yet.

If the changes haven't reverted, please guide me on how I can fix it. Sorry and thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submodule changes are still there. I just pushed the commit that reverts all unnecessery changes e2d5f11. Please apply it on top of your branch.
Please don't include any submodule changes in the commits. I would recommend to call in your working folder to have actual state.

git submodule init
git submodule update

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmitry-gorokhov I have applied your commit as a patch and now the changes seem to have reverted. Please take a look and let me know.

Meanwhile, I was trying to rebase my branch with master but I receive merge conflicts like so (conflict on the whole folder?):

image

Could you please help?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use "git merge ... " instead of "git rebase ..." to pick up latest master changes.
Or squash all 10 commits from your branch into single one and call "git rebase ..." after that

@@ -246,6 +249,79 @@ static constexpr size_t vec_len_f16_neon = vec_len_neon / sizeof(ov::float16);
#endif

#ifdef OPENVINO_ARCH_ARM64
#if defined(__ARM_FEATURE_SVE) && !defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Could you please clarify why we need !defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC) check here?
  2. Lets use HAVE_SVE instead of __ARM_FEATURE_SVE

Copy link
Author

@NishantPrabhuFujitsu NishantPrabhuFujitsu Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It was a hotfix I had added to silence some errors when testing out my changes initially. They are no longer needed, so I have removed them in the latest commit.
  2. Updated in the latest commit.

@NishantPrabhuFujitsu NishantPrabhuFujitsu requested review from a team as code owners October 29, 2024 11:12
@github-actions github-actions bot added category: inference OpenVINO Runtime library - Inference category: GPU OpenVINO GPU plugin category: Python API OpenVINO Python bindings and removed no-match-files labels Oct 29, 2024
# elif defined(__aarch64__) && defined(__APPLE__)
int64_t result(0);
size_t size = sizeof(result);
const std::string& cap = "hw.optional.sve";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's just a placeholder? I don't see such HW capability on macOS

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked for current or upcoming SVE support on Apple Metal, but couldn't find any. To ensure I don't miss it, I checked on Perplexity and this is what it suggested. I'd like your suggestion on whether it should be kept or removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok to return false for Apple so far. We can consider SVE/SME support on M4 as separate activity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to return "false" in latest commit.

@dmitry-gorokhov
Copy link
Contributor

build_jenkins

@dmitry-gorokhov
Copy link
Contributor

@NishantPrabhuFujitsu The PR looks good! Will merge once CI is green.

@dmitry-gorokhov dmitry-gorokhov added this to the 2025.0 milestone Dec 11, 2024
@NishantPrabhuFujitsu NishantPrabhuFujitsu force-pushed the mha-single-token-arm-sve-f32 branch from 8feb109 to 01faee6 Compare December 11, 2024 11:16
@dmitry-gorokhov
Copy link
Contributor

build_jenkins

@NishantPrabhuFujitsu NishantPrabhuFujitsu force-pushed the mha-single-token-arm-sve-f32 branch from be60fbc to a81bf1e Compare December 12, 2024 11:24
@dmitry-gorokhov
Copy link
Contributor

build_jenkins

@dmitry-gorokhov
Copy link
Contributor

@NishantPrabhuFujitsu Now all tests pass well! The only remaining thing is to fix code-style job: https://github.com/openvinotoolkit/openvino/actions/runs/12295407288/job/34312381372?pr=27273.
Here is an info how to do it locally (apply clang-format): https://github.com/openvinotoolkit/openvino/blob/master/docs/dev/coding_style.md

@NishantPrabhuFujitsu NishantPrabhuFujitsu force-pushed the mha-single-token-arm-sve-f32 branch from 9dca018 to 1f13550 Compare December 13, 2024 10:04
@NishantPrabhuFujitsu
Copy link
Author

NishantPrabhuFujitsu commented Dec 13, 2024

@dmitry-gorokhov I've followed the instructions in the documentation you shared and a lot of files had indentation-like changes in the end. I've not used this before so I'm not 100% sure if the changes have come in correctly (~70 files had changes). I'll wait to see if the pipelines are still succeeding.

@dmitry-gorokhov
Copy link
Contributor

dmitry-gorokhov commented Dec 13, 2024

@dmitry-gorokhov I've followed the instructions in the documentation you shared and a lot of files had indentation-like changes in the end. I've not used this before so I'm not 100% sure if the changes have come in correctly (~70 files had changes). I'll wait to see if the pipelines are still succeeding.

Yeah, that is expected. We haven't applied clang-format for ARM files yet.
Code style job passes now.

@dmitry-gorokhov
Copy link
Contributor

build_jenkins

OV_CASE(Algorithm::EltwiseIsInf, jit_is_inf_emitter),
OV_CASE(Algorithm::EltwiseIsNaN, jit_is_nan_emitter),
OV_CASE(Algorithm::EltwiseLessEqual, jit_less_equal_emitter),
OV_CASE(Algorithm::EltwiseLogicalAnd, jit_logical_and_emitter),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason EltwiseLogicalOr is deleted from this list. It leads to failed tests. Please return it back.

Copy link
Author

@NishantPrabhuFujitsu NishantPrabhuFujitsu Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what caused this. It's not the style check because after adding it back and running the style check, it didn't get removed. However I did notice it disappear momentarily and reappear when I was rebasing my branch after syncing master. Anyway, it should be there this time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, this indeed is happening due to clang-format. When my colleague built #27841 with clang_format_fix_all, EltwiseOr disappeared from both places. Not sure about the reason for this behavior though.

OV_CASE(Algorithm::EltwiseIsInf, ov::intel_cpu::aarch64::jit_is_inf_emitter),
OV_CASE(Algorithm::EltwiseLessEqual, ov::intel_cpu::aarch64::jit_less_equal_emitter),
OV_CASE(Algorithm::EltwiseLogicalAnd, ov::intel_cpu::aarch64::jit_logical_and_emitter),
OV_CASE(Algorithm::EltwiseLogicalNot, ov::intel_cpu::aarch64::jit_logical_not_emitter),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason EltwiseLogicalOr is deleted from this list. It leads to failed tests. Please return it back.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in latest commit.

@NishantPrabhuFujitsu NishantPrabhuFujitsu force-pushed the mha-single-token-arm-sve-f32 branch from 13d31e1 to 1407cfb Compare December 13, 2024 14:34
@dmitry-gorokhov
Copy link
Contributor

build_jenkins

@NishantPrabhuFujitsu
Copy link
Author

NishantPrabhuFujitsu commented Dec 16, 2024

https://github.com/openvinotoolkit/openvino/actions/runs/12317566674/job/34451409053?pr=27273#step:12:104

Any idea why this is happening? I didn't make any changes to os_flags.cmake in the last few commits. Same error in one of the required checks too.

@dmitry-gorokhov
Copy link
Contributor

https://github.com/openvinotoolkit/openvino/actions/runs/12317566674/job/34451409053?pr=27273#step:12:104

Any idea why this is happening? I didn't make any changes to os_flags.cmake in the last few commits. Same error in one of the required checks too.

Looks like CI issue. I restarted the job.

@dmitry-gorokhov dmitry-gorokhov added this pull request to the merge queue Dec 16, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: build OpenVINO cmake script / infra category: CPU OpenVINO CPU plugin category: inference OpenVINO Runtime library - Inference ExternalPR External contributor platform: arm OpenVINO on ARM / ARM64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants