Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aarch64 paged attention enablement #27841

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ashwins990
Copy link

@ashwins990 ashwins990 commented Nov 30, 2024

This development is related to Feature Request : #26422

Benchmarking Results

Machine : Graviton 3 - 64 cores

vLLM serving benchmark on ShareGPT dataset

Requests / sec avg TTFT ( sec ) avg TPOT ( sec ) output / Total - Throughput ( tokens/ sec )
0.2 1.153 0.186 38.73 / 77.79
0.5 2.083 0.482 94.10 / 187.92

vLLM Throughput benchmark on ShareGPT dataset

plot_aarch64_sve

vLLM with openvino backend

Clone the vLLM repo
Set inference precision as f32 before model compilation by setting Execution Mode to ACCURACY

// file_path : vllm/model_executor/model_loader/openvino.py
import openvino.properties.hint as hints
ov_core.set_property(
            "CPU",
            {hints.execution_mode: hints.ExecutionMode.ACCURACY},
        )
ov_compiled = ov_core.compile_model(pt_model.model, ov_device)
self.ov_request = ov_compiled.create_infer_request()

Note : If we don't set the inference precision as f32 it will take the f16 precision path. This can lead to Segmentation Fault [ in Aarch64 ] due to the presence of [ Optional Variable - Alibi param ] in transformation graph. Optional variables are graph nodes with empty shapes.

After the above change, Follow this link and install vLLM from source.

@ashwins990 ashwins990 requested review from a team as code owners November 30, 2024 14:31
@github-actions github-actions bot added category: inference OpenVINO Runtime library - Inference category: CPU OpenVINO CPU plugin category: build OpenVINO cmake script / infra labels Nov 30, 2024
@sys-openvino-ci sys-openvino-ci added the ExternalPR External contributor label Nov 30, 2024
@ashwins990 ashwins990 force-pushed the aarch64-paged-attention-enablement branch from d886f40 to 3bcc293 Compare December 4, 2024 10:49
@dmitry-gorokhov
Copy link
Contributor

build_jenkins

@dmitry-gorokhov dmitry-gorokhov self-assigned this Dec 5, 2024
@ashwins990
Copy link
Author

Hi @alvoron, thanks for looking into this. Yes, this PR has the latest changes.

To make it work, I did the below modification to vLLM before installation,

file_path : vllm/model_executor/model_loader/openvino.py

import openvino.properties.hint as hints
ov_core.set_property(
            "CPU",
            {hints.execution_mode: hints.ExecutionMode.ACCURACY},
        )
ov_compiled = ov_core.compile_model(pt_model.model, ov_device)
self.ov_request = ov_compiled.create_infer_request()

Add the first two lines before ov_compiled in link
By setting hints to ACCURACY , I was able to run the vLLM benchmark script and others as well.

The vLLM command is correct. Could you please confirm if you have made the above change to vLLM ?

@dmitry-gorokhov
Copy link
Contributor

@ashwins990 Some CI jobs have failed (e.g. https://github.com/openvinotoolkit/openvino/actions/runs/12158176037/job/33968356475?pr=27841). Could you please take a look?

@dmitry-gorokhov dmitry-gorokhov added this to the 2025.0 milestone Dec 11, 2024
@dmitry-gorokhov dmitry-gorokhov added the platform: arm OpenVINO on ARM / ARM64 label Dec 11, 2024
@alvoron
Copy link
Contributor

alvoron commented Dec 11, 2024

@ashwins990 I implemented couple fixes to support f16 in PagedAttention: #28017

Could you please cherry-pick these changes and try fp16 inference again?

bool is_bf16 = inType == ov::element::bf16;
M_blk = matmulOptimalM;
M_tail = M % M_blk;
brgVnniFactor = 4 / inType.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VNNI naming is not applicable for ARM I would say. Can we put smt like "kBlkStep"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "kBlkStep". Thanks.

b_transposed(b_transposed),
inType(inType) {
// blocking M
bool is_bf16 = inType == ov::element::bf16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have plans to enable it for fp16 as well? Default inference precision on ARM is fp16. so to get better perf we will need to extend the code with fp16 support.
BF16 is not used by OV on ARMs as of now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, Brgemm kernel does not support f16. We are planning to support int8 brgemm kernel and once its available, I can integrate it here. For now, should I remove this check?

For f16 path, it can go through Gemm ACL ig. I am checking the same. Thanks.

Comment on lines 44 to 45
const size_t get_mblk_size() const {
return matmulOptimalM;
}
const size_t get_k_blk() const {
return K_blk;
}
const size_t get_wsp_size() const {
return 4 * 1024;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused. Please remove

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_wsp_size() is used in code. I have removed the other two. Thanks

size_t M = 0, N = 0, K = 0, LDA = 0, LDB = 0, LDC = 0;
dnnl_data_type_t dt_in0 = dnnl_data_type_undef;
dnnl_data_type_t dt_in1 = dnnl_data_type_undef;
char palette[64];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pallete is x64 AMX specifics. Can be removed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Thanks

dnnl_data_type_t dt_in0 = dnnl_data_type_undef;
dnnl_data_type_t dt_in1 = dnnl_data_type_undef;
char palette[64];
bool is_with_comp = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable for ARM. Please remove

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

namespace Cpu {
namespace XARCH {

#define prefetch_bytes(bytes, sel, advance, src)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there SW prefetch on aarch64? If no lets delete the macro

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since prefetch is not used in pa, I have removed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since functions inside relates to PA specifically, I would propose to rename to pa_kernels.hpp

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, renamed to "pa_kernels.hpp"

@ashwins990 ashwins990 force-pushed the aarch64-paged-attention-enablement branch from 3bcc293 to 0ccdde6 Compare December 12, 2024 12:12
@dmitry-gorokhov
Copy link
Contributor

build_jenkins

@dmitry-gorokhov
Copy link
Contributor

@ashwins990 Could you please rebase your PR on latest master to incorporate changes from #27273?

github-merge-queue bot pushed a commit that referenced this pull request Dec 18, 2024
### Details:
 - Fix: do not infer non-executable node if it's in constant path
- Fix: redefine the 2nd output memory of `PagedAttention` node if
`m_hasScore` is `false`
- These 2 fixes required to support fp16 PagedAttention:
#27841

### Tickets:
 - *ticket-id*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: build OpenVINO cmake script / infra category: CPU OpenVINO CPU plugin category: inference OpenVINO Runtime library - Inference ExternalPR External contributor platform: arm OpenVINO on ARM / ARM64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants