Aarch64 paged attention enablement #27841

ashwins990 · 2024-11-30T14:31:37Z

This development is related to Feature Request : #26422

Benchmarking Results

Machine : Graviton 3 - 64 cores

vLLM serving benchmark on ShareGPT dataset

Requests / sec	avg TTFT ( sec )	avg TPOT ( sec )	output / Total - Throughput ( tokens/ sec )
0.2	1.153	0.186	38.73 / 77.79
0.5	2.083	0.482	94.10 / 187.92

vLLM Throughput benchmark on ShareGPT dataset

vLLM with openvino backend

Clone the vLLM repo
Set inference precision as f32 before model compilation by setting Execution Mode to ACCURACY

// file_path : vllm/model_executor/model_loader/openvino.py
import openvino.properties.hint as hints
ov_core.set_property(
            "CPU",
            {hints.execution_mode: hints.ExecutionMode.ACCURACY},
        )
ov_compiled = ov_core.compile_model(pt_model.model, ov_device)
self.ov_request = ov_compiled.create_infer_request()

Note : If we don't set the inference precision as f32 it will take the f16 precision path. This can lead to Segmentation Fault [ in Aarch64 ] due to the presence of [ Optional Variable - Alibi param ] in transformation graph. Optional variables are graph nodes with empty shapes.

After the above change, Follow this link and install vLLM from source.

dmitry-gorokhov · 2024-12-05T11:02:17Z

build_jenkins

ashwins990 · 2024-12-05T17:55:33Z

Hi @alvoron, thanks for looking into this. Yes, this PR has the latest changes.

To make it work, I did the below modification to vLLM before installation,

file_path : vllm/model_executor/model_loader/openvino.py

import openvino.properties.hint as hints
ov_core.set_property(
            "CPU",
            {hints.execution_mode: hints.ExecutionMode.ACCURACY},
        )
ov_compiled = ov_core.compile_model(pt_model.model, ov_device)
self.ov_request = ov_compiled.create_infer_request()

Add the first two lines before ov_compiled in link
By setting hints to ACCURACY , I was able to run the vLLM benchmark script and others as well.

The vLLM command is correct. Could you please confirm if you have made the above change to vLLM ?

dmitry-gorokhov · 2024-12-11T07:58:38Z

@ashwins990 Some CI jobs have failed (e.g. https://github.com/openvinotoolkit/openvino/actions/runs/12158176037/job/33968356475?pr=27841). Could you please take a look?

alvoron · 2024-12-11T13:03:54Z

@ashwins990 I implemented couple fixes to support f16 in PagedAttention: #28017

Could you please cherry-pick these changes and try fp16 inference again?

dmitry-gorokhov · 2024-12-11T09:40:37Z

src/plugins/intel_cpu/src/nodes/kernels/aarch64/brgemm_kernel.cpp

+    bool is_bf16 = inType == ov::element::bf16;
+    M_blk = matmulOptimalM;
+    M_tail = M % M_blk;
+    brgVnniFactor = 4 / inType.size();


VNNI naming is not applicable for ARM I would say. Can we put smt like "kBlkStep"?

Changed to "kBlkStep". Thanks.

dmitry-gorokhov · 2024-12-11T09:45:17Z

src/plugins/intel_cpu/src/nodes/kernels/aarch64/brgemm_kernel.cpp

+      b_transposed(b_transposed),
+      inType(inType) {
+    // blocking M
+    bool is_bf16 = inType == ov::element::bf16;


Do you have plans to enable it for fp16 as well? Default inference precision on ARM is fp16. so to get better perf we will need to extend the code with fp16 support.
BF16 is not used by OV on ARMs as of now.

Currently, Brgemm kernel does not support f16. We are planning to support int8 brgemm kernel and once its available, I can integrate it here. For now, should I remove this check?

For f16 path, it can go through Gemm ACL ig. I am checking the same. Thanks.

dmitry-gorokhov · 2024-12-11T09:52:03Z

src/plugins/intel_cpu/src/nodes/kernels/aarch64/brgemm_kernel.hpp

+    const size_t get_mblk_size() const {
+        return matmulOptimalM;
+    }
+    const size_t get_k_blk() const {
+        return K_blk;
+    }
+    const size_t get_wsp_size() const {
+        return 4 * 1024;
+    }


Unused. Please remove

get_wsp_size() is used in code. I have removed the other two. Thanks

dmitry-gorokhov · 2024-12-11T09:52:47Z

src/plugins/intel_cpu/src/nodes/kernels/aarch64/brgemm_kernel.hpp

+        size_t M = 0, N = 0, K = 0, LDA = 0, LDB = 0, LDC = 0;
+        dnnl_data_type_t dt_in0 = dnnl_data_type_undef;
+        dnnl_data_type_t dt_in1 = dnnl_data_type_undef;
+        char palette[64];


pallete is x64 AMX specifics. Can be removed

Removed. Thanks

dmitry-gorokhov · 2024-12-11T09:53:05Z

src/plugins/intel_cpu/src/nodes/kernels/aarch64/brgemm_kernel.hpp

+        dnnl_data_type_t dt_in0 = dnnl_data_type_undef;
+        dnnl_data_type_t dt_in1 = dnnl_data_type_undef;
+        char palette[64];
+        bool is_with_comp = false;


Not applicable for ARM. Please remove

dmitry-gorokhov · 2024-12-11T09:56:42Z

src/plugins/intel_cpu/src/nodes/kernels/aarch64/sve_utils.hpp

+namespace Cpu {
+namespace XARCH {
+
+#define prefetch_bytes(bytes, sel, advance, src)


Is there SW prefetch on aarch64? If no lets delete the macro

Since prefetch is not used in pa, I have removed it.

dmitry-gorokhov · 2024-12-11T10:27:06Z

src/plugins/intel_cpu/src/nodes/kernels/aarch64/sve_utils.hpp

Since functions inside relates to PA specifically, I would propose to rename to pa_kernels.hpp

Sure, renamed to "pa_kernels.hpp"

dmitry-gorokhov · 2024-12-12T12:33:12Z

build_jenkins

ashwins990 requested review from a team as code owners November 30, 2024 14:31

github-actions bot added category: inference OpenVINO Runtime library - Inference category: CPU OpenVINO CPU plugin category: build OpenVINO cmake script / infra labels Nov 30, 2024

sys-openvino-ci added the ExternalPR External contributor label Nov 30, 2024

ashwins990 force-pushed the aarch64-paged-attention-enablement branch from d886f40 to 3bcc293 Compare December 4, 2024 10:49

dmitry-gorokhov self-assigned this Dec 5, 2024

dmitry-gorokhov added this to the 2025.0 milestone Dec 11, 2024

dmitry-gorokhov added the platform: arm OpenVINO on ARM / ARM64 label Dec 11, 2024

alvoron mentioned this pull request Dec 11, 2024

[CPU][ARM] PagedAttention fixes #28017

Open

dmitry-gorokhov reviewed Dec 11, 2024

View reviewed changes

ashwins990 force-pushed the aarch64-paged-attention-enablement branch from 3bcc293 to 0ccdde6 Compare December 12, 2024 12:12

ashwins990 added 4 commits December 16, 2024 10:59

initial commit

646d633

Resolved failing test case

308b346

Code cleaned, file renamed.

e41e1d9

clang format code changes, minor naming change

fa78a3a

NishantPrabhuFujitsu mentioned this pull request Dec 16, 2024

[ARM] [SDPA] SVE implementation of MHASingleToken for FP32 #27273

Open

ashwins990 force-pushed the aarch64-paged-attention-enablement branch from 0ccdde6 to fa78a3a Compare December 16, 2024 05:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aarch64 paged attention enablement #27841

Aarch64 paged attention enablement #27841

ashwins990 commented Nov 30, 2024 •

edited

Loading

dmitry-gorokhov commented Dec 5, 2024

ashwins990 commented Dec 5, 2024

dmitry-gorokhov commented Dec 11, 2024

alvoron commented Dec 11, 2024

dmitry-gorokhov Dec 11, 2024

ashwins990 Dec 12, 2024

dmitry-gorokhov Dec 11, 2024

ashwins990 Dec 13, 2024

dmitry-gorokhov Dec 11, 2024

ashwins990 Dec 12, 2024

dmitry-gorokhov Dec 11, 2024

ashwins990 Dec 12, 2024

dmitry-gorokhov Dec 11, 2024

ashwins990 Dec 12, 2024

dmitry-gorokhov Dec 11, 2024

ashwins990 Dec 16, 2024

dmitry-gorokhov Dec 11, 2024

ashwins990 Dec 12, 2024

dmitry-gorokhov commented Dec 12, 2024

Aarch64 paged attention enablement #27841

Are you sure you want to change the base?

Aarch64 paged attention enablement #27841

Conversation

ashwins990 commented Nov 30, 2024 • edited Loading

Benchmarking Results

vLLM serving benchmark on ShareGPT dataset

vLLM Throughput benchmark on ShareGPT dataset

vLLM with openvino backend

dmitry-gorokhov commented Dec 5, 2024

ashwins990 commented Dec 5, 2024

dmitry-gorokhov commented Dec 11, 2024

alvoron commented Dec 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitry-gorokhov commented Dec 12, 2024

ashwins990 commented Nov 30, 2024 •

edited

Loading