-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[ARM] [SDPA] SVE implementation of MHASingleToken for FP32 (#27273)
### Details: - Adds SVE FP32 implementations for functions called during execution of `MHASingleToken` for SVE-128, SVE-256 and SVE-512 platforms. - SVE implementations are compiled only if runtime support for SVE is detected on the hardware, otherwise it falls back to Neon. - Adds a new implementation for exponential function `exp_ps_<isa>` using fewer FMA operations. Executes ~18% faster and has better output precision. **Note:** I am aware of the Neon FP16 implementation of SDPA added recently. To accommodate for this, the current SVE changes will be used only if the hardware does not have ARM FP16 support. I will follow up with SVE FP16 implementations soon. ### [SVE] Benchmarking results Below are the benchmarking results of execution time of each ported function. Measurements were performed by running each function individually on dummy inputs (128 fp32 elements) for 1,000,000 iterations and computing average time (in micro-seconds). ![image](https://github.com/user-attachments/assets/3f82238f-af7e-4b68-b4b1-259cf389e41a) Execution time of `MHASingleToken` as a whole was also measured for two LLMs, the results of which are shown below. For LlaMA-3-8B, the SVE-128 and SVE-512 systems at my disposal did not have enough memory, so only SVE-256 results are shown. While there is an improvement overall, these results could be contaminated with run-to-run variation due to the small execution time of the kernel. **Benchmarking details:** Prompt length of 108 tokens was used; total time for generating 50 tokens was measured and average execution time was computed. ![image](https://github.com/user-attachments/assets/893c1c46-085f-46af-ab5a-2c1481c75f68) ### New exponential implementation It is based on the discussion in [these slides](https://www.slideshare.net/slideshow/hpc-phys20201203/239717194#23) (this is based on a past talk in Fujitsu hence the document is in Japanese, sorry!). The algorithm followed is slightly different from the current implementation, in that it uses `fexpa` instruction available on ARM and requires only 3 Taylor expansion terms (2 FMA operations) to be precise until the 8th decimal place. Our benchmarking results showed this implementation to be 44%-58% faster than the existing Neon implementation. It is ~18% faster than the SVE implementation of the current algorithm in Neon. ![image](https://github.com/user-attachments/assets/117df21d-3977-499c-8ab8-8f4346286113) In this PR, the new implementation is called by default. The SVE port of the existing Neon implementation has also been retained, if needed.
- Loading branch information
1 parent
44f7ddb
commit b543d0b
Showing
10 changed files
with
437 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.