-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]: Optimize PagedAttention operation on aarch64 HW #26422
Comments
@dmitry-gorokhov, can we add |
Hi @dmitry-gorokhov @rkazants @wenjiew , is anyone working on this issue ? If not can I take it up ? |
it's your now, have fun :) |
Hi , currently the paged attention executor logic depends on brgemm (only supports x86-64) , so to extend the support for ARM-64 (aarach-64) , which library should we use. following are the prospects - ARM compute library(ACL) , OpenBLAS, Eigen, TFlite. |
Hi @samkitshah1262. OneDNN has Brgemm block implemented for aarch64. The problem is it supports only SVE512/SVE256. |
@dmitry-gorokhov Implementation for PA on ARM using NEON and ACL is done , except the MHAHelper class as the ACL Gemm kernal does not support stride (yet). Could you please review the current changes and provide direction. Thanks! |
Request Description
PagedAttention operation is already implemented in bounds of CPU plugin using C++ and optimized for x64 using avx2/avx512 instrinsics.
The request is to optimize PA operation for aarch64 using NEON/SVE extensions.
Please refer to SDPA optimization using NEON for reference.
How to build OV on ARM: https://github.com/openvinotoolkit/openvino/blob/master/docs/dev/build.md
Feature Use Case
PagedAttention operation implements attention algo required for workloads like continuous batching or speculative decoding. PagedAttention is used as basic attention block in VLLM OpenVINO backend and under OpenVINO GenAI API (for some use-cases). PA operation might take significant resources for execution (especially for long contexts), so its optimization is crucial for overall LLM based workloads.
Issue submission checklist
The text was updated successfully, but these errors were encountered: