-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization #5780
Conversation
Hi, thanks for sharing, I was excited about this, but it doesn't seem to do what I expected for my device. Perhaps I'm not building correctly? Here's my command:
(master) test with gemma 2b 8_0:
(PR) test:
It's much slower. Maybe it only works with a certain build? It's possible I'm doing something wrong, but I don't see it. |
ggml.c
Outdated
@@ -1,3 +1,4 @@ | |||
// SPDX-FileCopyrightText: Copyright 2024 Arm Ltd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llama.cpp's copyright owner is ARM now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cebtenzzre Hi, these copyright notices clarify the copyright situation for the code contributed in this PR. Is there a more appropriate place in this project that they should be moved to? Please suggest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disclaimer up front: I am not a lawyer, and this is not legal advice.
Joint authorship can be messy, and I think for legal reasons it is simplest for an open-source project such as llama.cpp to have a single copyright holder (see e.g. the GPL FAQ). That would be Georgi Gerganov. The SYCL backend is currently copyright Intel, but that's not too big a deal because it's in a separate file that is not compiled by default.
I think it would be best if you could either assign the copyright of your changes to Georgi Gerganov, or to disclaim copyright entirely and put them in the public domain. The MIT license is very permissive, so you don't really lose anything by assigning copyright. If you'd like to clarify the attribution of your changes, I think something like this could work:
// SPDX-FileCopyrightText: Copyright 2024 Georgi Gerganov
// SPDX-FileContributor: Arm Ltd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cebtenzzre Hi, we attempted to address your concern about copyright in the most recent changes that we upstreamed last week. Please see our most recent comment from May 1, which describes our changeset. Could you please review the changeset and let us know if that is okay? Thanks
@Jeximo Hi, could you please try using -march=armv8.2-a? you may need to add +i8mm to define the MMLA intrinsics. |
@Dibakar Thanks for the tips.
I'm inexperienced so if you're clear then I'm open to suggestions. All good either way. |
Maybe all the data rearrangement that this change requires could be implemented in a less intrusive way as a different backend. It is not really possible at the moment, supporting backends that only implement matrix multiplication would require changes to |
@Jeximo what is the prompt (prompt eval time) and text generation (eval time) token/s rate now for armv8.2-a with and without the patch? |
@Dibakar to be as clear as possible, I build for cpu. Not openblas, not gpu, not vulkan. Here's build info:
Master with
PR with
command: For my device,
Admittedly, eval time in |
As of writing current head
PR branch:
Tests done on my OnePlus 12 (Android) under Termux. Things I have noticed:
All tests I ran with ./main -m ../westlake-7b-v2.Q4_0.gguf -t 4 -p "why is the sky blue?" --no-mmap -n 128 |
These changes are too intrusive - there a lot of new quantization structs introduced to the core library which would be difficult to maintain. The approach proposed earlier in the discussion for implementing a dedicated backend and moving the entire matrix multiplication code there should be considered instead Also, one additional thing to consider is that |
@ggerganov @slaren Sure, we should be able to place things in extra field of ggml_tensor. |
@ggerganov @slaren Hi, as I mentioned in my previous response, the code is made up of two main parts and associated structures. One is weight rearrangement, and the other is actual optimized kernels. Please suggest if there is a specific backend file/location where you want our weight rearrangement code to be moved. The optimized matrix multiplication kernels ggml_gemv_q4_0_q8_0 and ggml_gemm_q4_0_q8_0 are currently defined in ggml-quants.h/c, where the other vec_dot/gemm kernels for other variants are defined. Should we keep them there? Please suggest. |
@Dibakar Ideally, this would be implemented as a different backend that implements the ggml-backend interface. This would allow us to isolate all this code to a different file, without modifying the internals of ggml, and without requiring applications to call additional functions to convert the tensors. However, it is not possible to do this at this moment due to the limitations that I mentioned earlier - we need to modify |
@slaren Thanks for the feedback. I would like to clarify that the weight rearrangement step can be done completely offline when converting an HF file to a GGUF file. We can incorporate it into the current convert-hf-to-gguf.py for Arm aarch64 optimized kernels, and make provisions for generating another GGUF in addition to the current GGUF file and using it for inference. This will remove all of the changes we made to ggml.c/.h and llama.cpp. We needed to do this once before the inference begins, and we placed the code to ggml.c/.h and llama.cpp. However, this code can be completely removed. After that, we will be left with only our ggml_gemv_q4_0_q8_0, and ggml_gemm_q4_0_q8_0 kernels, and quantize_row_q4_0 functions that we introduced in the ggml-quants.c file, where the vec-dot/gemm kernels for different quantization methods are defined. Our quantize_row_q4_0 changes can be easily refactored to incorporate into the existing quantize_q4_0 function. In addition, we need to keep a few lines of code in ggml.c's forward_mul_mat function to call our optimized kernels whenever possible. We can definitely refactor the code around the current vec_dot call in the forward_mul_mat function or make it part of the vec_dot call to introduce this in a non-intrusive manner If we make these changes, I was wondering if we still need a separate backend for them. We can submit the changes for that in the coming days, making the overall changes significantly less intrusive than they appear. |
I think that could also work, you could add new data types for the rearranged Q4_0 and Q8_0, add the ability to quantize models to these data types, and add all the relevant functions for these data types in |
Yes, rearranged data types should work. This way the logic in Try to remove the changes in |
I have tested the PR on AWS Graviton3 ( r7g.16xl instance). for q4_0 quantized inference, it improved the prompt evaluation performance by around 1.75x for 64 thread and 2.5x for single thread configuration. the weights reordering/prepacking latency is the only concern i see, once it's addressed with new quantization format, the changes should be fine. command:
64 vcpus results: mainline
PR:
single thread results: mainline:
PR:
|
A generalized capability to re-process the weights on a model load would be quite useful to a lot of people, I think. This patch was doing it for a narrow purpose, but considering how cheap/fast your quantization is, why can't you have "online quantization" (which then could handle this narrow case too). After all, HF transformers can do that. That would make logistics of evaluating a bunch of different flavors much simpler. No need to produce and post a ton of GGUF files that will mostly ever be used once (per person evaluating them) and just clutter everyone's storage. It would be enough to just keep the fp16 flavor around - which are already being released by some newer original models (e.g., gemma). If that does not put TheBloke out of business, it would lighten the load at least. Also would significantly reduce the problem of obsolete GGUF files spread everywhere when something changes in the code. |
FYI, @Dibakar here's this PR (your branch + PR cherry-pick) on one of the latest snapdragons.
|
Sorry, that was a build problem. The matmul int8 got enabled in the ggml (where I did check it) but not in llama (where I did not). So the new functions were getting called with NULL for the "rerranged weights" buffer. While debugging that, I did notice that all the "rearrange" functions just call malloc() unchecked, and are declared void, which would lead to some lovely crashes once you run out of memory. And that will happen, because it doesn't look like the rearranged buffers are ever freed. Ran the llama-bench with some multiple models, and that did happen. Of course, if this all goes into some offline "new quantization types", it is all moot. So, now to the good part - that is the performance. It is pretty amazing what this PR does. First, some baseline numbers for the current master branch:
And now, Dibakar's brach with this PR, on the same device.
|
Can this be extended to support Q4_1? |
@akingoverlook Yes, it can easily be extended to support Q4_1 (it has the additional fp16 m field). |
Hello, this is really a fantastic work! I did get faster inference speed when trying this PR with online rearrangement, but at the same time, its memory overhead also increased. I think this is the additional memory overhead caused by rearrangement, so I want to try the offline rearrangement solution. However, due to my limited ability, I may not have enough ability to modify the code of convert-hf-to-gguf.py. Therefore, I would like to request if you could provide an example code, Thank you! |
b8983a0
to
85dee42
Compare
@ggerganov @slaren @cebtenzzre Thank you all for your suggestions. We attempted to address all of these suggestions in the current changes. We defined a new quantization data type, Q4_0_AARCH64, as suggested by the llama.cpp reviewers, added the ability to quantize models to this datatype, and added all the relevant functions for these data types in type_traits to support them. We removed the changes in llama.cpp/ggml.c like main llama.cpp files and re-implemented the rearrange_ functions as quantization function for the new type. We added two new files, ggml-aarch64.cpp and ggml-aarch64.h, to llama.cpp to place our Arm optimized kernels, and added minor changes to the ggml.c, llama.cpp-like main files that interact with our kernels. We have added the copyright claim only to the ggml-aarch64.cpp and ggml-aarch64.h files where we have placed our optimized kernels, and removed the copyright claim from the remaining main llama.cpp files where we have made minor changes simply to interact with the kernels. Note that Arm is not assigning copyright in these contributions. We rearrange the weights offline while the code is quantized to Q4_0. As a result, when we run the inference later, the weight loading time is identical to loading a Q4_0 model file. Our changes automatically detect the underlying hardware platform (for example, Graviton2/3/4, and so on) and generate the appropriate aarch64 compliant weight format, which is then used in inference. In comparison to Q4_0, using Q4_0_AARCH64 gguf format does not result in an additional memory overhead. We have addressed it also in the changeset. We have optimized our kernels further for different Arm cpu architecture, which resulted in a higher performance than we showed last time. Please see below for the updated performance numbers. We also ran the perplexity test to ensure that the perplexity remained consistent with the original Q4_0 model, as we expected. We have included provisions for converting both an fp16 HF model to q4_0_aarch64 and an existing q4_0 from HF to q4_0_aarch64. Please see below for the required command lines to convert a model to Q4_0_AARCH64 gguf format before using it for inference. Single inference: Batched inference: |
@ggerganov @slaren @cebtenzzre Hi, we attempted to address your suggestions for this PR in the most recent changes, which we upstreamed last week. Please see our most recent comment from May 1, which describes our changeset. Could you please review the changeset and let us know your feedback? Thanks |
I'm failing to build the latest optimizations: cmake PR build logcmake -B build -DCMAKE_C_FLAGS=-march=armv8.4-a+dotprod+i8mm && cd build && cmake --build . --config Release --target server --target main --target llama-bench --target quantize && cd bin/ -- The C compiler identification is Clang 18.1.4 [ 5%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o Maybe my device is now incompatible..? Same instruction works on master: master cmakecmake -B build -DCMAKE_C_FLAGS=-march=armv8.4-a+dotprod+i8mm && cd build && cmake --build . --config Release --target server --target main --target llama-bench --target quantize && cd bin/ -- The C compiler identification is Clang 18.1.4 [ 6%] Generating build details from Git |
C_FLAGS aren't passed to C++ compiler, you need to set similar CXX_FLAGS to enable dotprod+i8mm. |
ggml-aarch64.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance improvements are lovely, but the logistical aspect of this can be improved.
It should be easy enough to detect availability of SVE/NEON/I8MM at runtime. Don't need to tell how to people working for ARM. That logic should be placed into ggml_cpu_has_neon(), ggml_cpu_has_matmul_int8() which right now just depend on the same __ARM_NEON/__ARM_FEATURE_MATMUL_INT8 flags that are set at compile time.
Since most people will be cross-compiling for ARM, at least half won't set the correct flags and will never get to see the benefits. Those that do set them, will face the problem with targeting multiple device types (as people targeting ARM often have to) and needing to produce, package, and deploy multiple/matching libraries/binaries.
Removing that headache should be worth the trouble of doing the dynamic detection.
Also, the new quantization types __AARCH64 would be dependent on what compile flags were set, and then the quantized models would not be interchangeable between ARM targets with different set of flags. I would need different quantized model for my different ARM devices, and they all would have the same type. That is unmanageable in any practical sense. The quantization type needs to convey compatibility with inference HW, so you would need one per ISA combination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the ISA detection method to runtime detection.
The 32-bit Armv7 build is failing: Probably have to check Also, take a look at this comment: #5780 (comment) |
We included |
…s for buiilding the Q4_0_4_4 quant type
@Dibakar Thank you for the contribution - it's now merged. It would be great to add in the future AArch64-optimized kernels for Q8 models - contributions are welcome |
@ggerganov Thank you so much for reviewing our changes and merging our PR into the mainline. Yes, we will discuss internally about contributing Q8 code. |
* Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files * Arm AArch64: minor code refactoring for rebase * Arm AArch64: minor code refactoring for resolving a build issue with cmake * Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: minor code change for resolving a build issue with server-windows * retrigger checks * Arm AArch64: minor code changes for rebase * Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits * Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig * Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: minor code refactoring * Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat * Arm AArch64: minimize changes in ggml_compute_forward_mul_mat * Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types * Arm AArch64: minor code refactoring * Arm AArch64: minor code refactoring * Arm AArch64: minor code refactoring * rebase on the latest master commit 3fd62a6 and adapt to the new directory structure * Arm AArch64: remove a redundant comment * Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off * Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels * Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type
* Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files * Arm AArch64: minor code refactoring for rebase * Arm AArch64: minor code refactoring for resolving a build issue with cmake * Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: minor code change for resolving a build issue with server-windows * retrigger checks * Arm AArch64: minor code changes for rebase * Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits * Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig * Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: minor code refactoring * Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat * Arm AArch64: minimize changes in ggml_compute_forward_mul_mat * Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types * Arm AArch64: minor code refactoring * Arm AArch64: minor code refactoring * Arm AArch64: minor code refactoring * rebase on the latest master commit 3fd62a6 and adapt to the new directory structure * Arm AArch64: remove a redundant comment * Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off * Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels * Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type
FYI, I got the Q4__0_4_4 variant to work for me on M2 Macs, Snapdragon X / Windows 11 24H2 / clang, and on Snapdragon X / Windows 11 24H2 / WSL2-Ubuntu24.04 / gcc. For me on the Mac, I had to -march=armv8.5-a and -D LLAMA_NO_ACCELERATE=1, otherwise it tends to segfault. The speed-improvement results are very impressive M2 MacBook Air (llama.cpp version: 3387 (fa79495) built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0):
Snapdragon X Plus / Windows 11 24H2 / clang (lama.cpp version: 3388 (e236528) built with (clang) for aarch64-pc-windows-msvc):
Snapdragon X Plus / Windows 11 24H2 / WSL2-Ubuntu24.04 / gcc (llama.cpp version: 3388 (e236528) built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for aarch64-linux-gnu):
P.S: the new aarch64-optimized GEMV and GEMM kernels for Q4_0_4_4 in ggml-aarch64.c do not work with MSVC (MSVC has no _asm_ on any 64-bit architecture) - please use clang on Windows for ARM. |
FYI, another potential use-case for the Q4_0_4_4 optimizations is ARM virtual machines, where there is no GPU-virtualization - e.g. with Apple silicon VMs and containers. Virtualization on macOS seems much slower than on Windows, where there is nearly no CPU-overhead for WSL2. I was able to get good acceleration with Q4_0_4_4 in M2 VMs (but starting on a bad virtualization-penalty base-line): virtual 4-cpu Windows 11 24H2 in Parallels (clang compiler):
virtual 4-cpu Ubuntu24.04 in Parallels:
Note:
|
- scikit-build -> scikit-build-core - Remove -DGGML_BLAS=ON, OpenBLAS cmake tags from build - Needed (see ggerganov/llama.cpp#5780 (review)) to get this to work properly - Built, deployed, and tested with llama3.1 with q4_0_4_4 quantization
For the Arm AArch64 architecture, this PR adds support for optimized GEMV (using dot instructions) and GEMM (using mmla instructions) kernels for the q4_0_q8_0 and q4_0_q8_0 quantization methods.
The feature is enabled if the platform supports __ARM_FEATURE_MATMUL_INT8 (for GEMM) and __ARM_NEON or __ARM_FEATURE_SVE (for GEMV).
On AWS Graviton3 processors, these kernels resulted in a 2.5x improvement in prompt evaluation over the existing GEMM mmla kernels, as well as a 2x improvement in text generation over the default vec_dot kernel (Feb 21 commit 973053d). Please see the table below.
Authors: David Mansell ([email protected]) and Dibakar Gope ([email protected])