-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aarch64 paged attention enablement #27841
base: master
Are you sure you want to change the base?
Aarch64 paged attention enablement #27841
Conversation
d886f40
to
3bcc293
Compare
build_jenkins |
Hi @alvoron, thanks for looking into this. Yes, this PR has the latest changes. To make it work, I did the below modification to file_path :
Add the first two lines before ov_compiled in link The vLLM command is correct. Could you please confirm if you have made the above change to vLLM ? |
@ashwins990 Some CI jobs have failed (e.g. https://github.com/openvinotoolkit/openvino/actions/runs/12158176037/job/33968356475?pr=27841). Could you please take a look? |
@ashwins990 I implemented couple fixes to support f16 in Could you please cherry-pick these changes and try fp16 inference again? |
bool is_bf16 = inType == ov::element::bf16; | ||
M_blk = matmulOptimalM; | ||
M_tail = M % M_blk; | ||
brgVnniFactor = 4 / inType.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VNNI naming is not applicable for ARM I would say. Can we put smt like "kBlkStep"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to "kBlkStep". Thanks.
b_transposed(b_transposed), | ||
inType(inType) { | ||
// blocking M | ||
bool is_bf16 = inType == ov::element::bf16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have plans to enable it for fp16 as well? Default inference precision on ARM is fp16. so to get better perf we will need to extend the code with fp16 support.
BF16 is not used by OV on ARMs as of now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, Brgemm kernel does not support f16. We are planning to support int8 brgemm kernel and once its available, I can integrate it here. For now, should I remove this check?
For f16 path, it can go through Gemm ACL ig. I am checking the same. Thanks.
const size_t get_mblk_size() const { | ||
return matmulOptimalM; | ||
} | ||
const size_t get_k_blk() const { | ||
return K_blk; | ||
} | ||
const size_t get_wsp_size() const { | ||
return 4 * 1024; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused. Please remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_wsp_size() is used in code. I have removed the other two. Thanks
size_t M = 0, N = 0, K = 0, LDA = 0, LDB = 0, LDC = 0; | ||
dnnl_data_type_t dt_in0 = dnnl_data_type_undef; | ||
dnnl_data_type_t dt_in1 = dnnl_data_type_undef; | ||
char palette[64]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pallete is x64 AMX specifics. Can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. Thanks
dnnl_data_type_t dt_in0 = dnnl_data_type_undef; | ||
dnnl_data_type_t dt_in1 = dnnl_data_type_undef; | ||
char palette[64]; | ||
bool is_with_comp = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not applicable for ARM. Please remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
namespace Cpu { | ||
namespace XARCH { | ||
|
||
#define prefetch_bytes(bytes, sel, advance, src) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there SW prefetch on aarch64? If no lets delete the macro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since prefetch is not used in pa, I have removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since functions inside relates to PA specifically, I would propose to rename to pa_kernels.hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, renamed to "pa_kernels.hpp"
3bcc293
to
0ccdde6
Compare
build_jenkins |
0ccdde6
to
fa78a3a
Compare
@ashwins990 Could you please rebase your PR on latest master to incorporate changes from #27273? |
### Details: - Fix: do not infer non-executable node if it's in constant path - Fix: redefine the 2nd output memory of `PagedAttention` node if `m_hasScore` is `false` - These 2 fixes required to support fp16 PagedAttention: #27841 ### Tickets: - *ticket-id*
This development is related to Feature Request : #26422
Benchmarking Results
Machine : Graviton 3 - 64 cores
vLLM serving benchmark on ShareGPT dataset
vLLM Throughput benchmark on ShareGPT dataset
vLLM with openvino backend
Clone the vLLM repo
Set inference precision as f32 before model compilation by setting
Execution Mode
toACCURACY
Note : If we don't set the inference precision as f32 it will take the f16 precision path. This can lead to Segmentation Fault [ in Aarch64 ] due to the presence of [ Optional Variable - Alibi param ] in transformation graph. Optional variables are graph nodes with empty shapes.
After the above change, Follow this link and install vLLM from source.