-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve cpu prompt eval speed #6414
Conversation
Please fix the CI builds |
Some very quick tests on my Ryzen 5950X (power limited to 95 W):
A very respectable speedup! Since you did not mention it in the OP, this PR does not touch the handling of NUMA nodes, correct? |
Is this not yet set up to support the CPU code used in partial GPU offloading? Will those require custom kernels? |
This PR will not speed up CPU+GPU hybrid inference in any meaningful capacity. For large batches you are compute bound and all of the evaluations are done on the GPU. For small batches you are I/O bound and better matrix multiplication algorithms make virtually no difference. |
Does this mean it moves layers onto the GPU for large batches instead of processing all GPU layers for the current batch and then doing the remaining layers on CPU? I'm sort of lost, this works against my current understanding (moving from CPU to GPU during inference should be slower)? |
CPU layers have their data in RAM. GPU layers have their data in VRAM. GPU layers are always evaluated on the GPU. The most recent update is this PR: #6083 . For large batch sizes (prompt processing) all data of a CPU layer is moved to the GPU and the calculations are done there in order to make use of the higher GPU compute. For small batch sizes (token generation) CPU layers are evaluated on the CPU. This PR improves the compute efficiency of CPU matrix multiplication. So it only helps in those scenarios where it would also be worthwhile to temporarily move data to VRAM. The improvements in this PR are therefore mutually exclusive with CPU+GPU hybrid inference. |
@jart this is pretty awesome ; I would add that since a good portion of the contributed code is very generic and could benefit to many other downstream projects, it would be even more awesome if that code could be in its own repo ; then a subset could be linked or vendored in here. |
@phymbert Tests are green. Please take a look. |
Thank you very much for the contribution. For the core library and ggml changes @slaren and @ggerganov will revert to you. |
@zougloub Thank you for the encouragement. You can copy sgemm.cpp into your codebase as its own library if you provide an implementation for |
@jart Apologies for the slow response - will review the PRs in the following days. Thanks |
Thanks @ggerganov I'm in no rush. |
This PR did absolutely nothing for me on Q4_0 and Q8_0, then I realised that it only supported AVX2 and AVX512 for those quants. It does support regular AVX though for F16 and F32. On my 4c/8t Xeon v2 I get a nice 2x speedup in F16. Just like vanilla llama.cpp you get the best CPU performance if you use all hyperthreads during prompt processing and switch to one thread per core for inference.
|
@netrunnereve in case you're not aware, you can run |
Does this PR benefit ARM CPU? |
ggml.c
Outdated
return; | ||
} | ||
UseGgmlGemm1: | ||
(void)0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's to avoid the compiler complaining if the label comes before a variable declaration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But using just ;
is a simple way to achieve the same goal without introducing dummy expressions.. Just do:
UseGgmlGemm1: ;
It works perfectly fine since C99 standard. See https://godbolt.org/z/6sbKhnhW9 .
BTW. This issue with labels is fixed in C23 standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tstanisl, you taught me something I didn't know. Fixed!
ggml.c
Outdated
return; | ||
} | ||
UseGgmlGemm2: | ||
(void)0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
I think so. |
Should I have acceleration for Q8 if I only have AVX and AVX2? I tested and found no differences. |
https://justine.lol/matmul/ is a must read ^^) Thank you @jart, you got a new Patron |
After this PR is merged I'll probably polish up the AVX implementation and submit it back up to llama.cpp. Since your llamafile is based off llama.cpp any updates that we make here should eventually get pulled into your project. Or you can just pick it up now and use it as you see fit in llamafile, I don't mind. |
Using it on many CPU setups and it speeds up everything on context processing! |
Can these kernels make the token generation faster? |
I think it probably not, because token generation is memory bandwidth bound |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU speedups are always welcome, but I’m worried about the maintenance efforts for the core ggml
library increasing, so I’m still hesitating how to proceed with this PR.
Similar discussion was already had in #5780 and there are likely to be other matrix-multiplication improvements proposed:
- ggml: aarch64: implement mmla kernels for q8_0_q8_0, q4_0_q8_0 and q4_1_q8_1 quantized gemm #4966
- add loongarch lsx and lasx optimize code #6454
- optimize for ppc64le using VSX intrinsics ggml#784
This change on one hand is well decoupled which is good, but at the same time introduces a new block-wise matrix-multiplication pattern that is different from the existing dot-based implementations. It’s obviously significantly more performant since it utilizes the CPU cache much more efficiently, which has not been the case so far. It also seems that the implementation can be extended to more instruction sets and quantum types in the future, so the amount of code has the potential to grow significantly.
The code is also in C++, while we generally prefer to keep the core implementation in C and allow C++ only in the backend implementations when desired. I’ve been pretty stubborn with this C requirement and it’s probably something to finally reconsider, but it’s not the time to decide in this PR.
I don’t want to delay this for much longer as I’ve already given this quite some thought and haven’t come to a good conclusion. I think the comments in #5780 apply to a good extend here (PTAL), so my suggestion is that we aim for this to become part of the future BLAS/matmul backend. The benefit of doing that is that the code becomes sort of an "extension" to ggml
and can be developed more independently, without drawing a lot of attention from the core maintainers.
In the meantime, we can merge this change and depending on how the development process goes (i.e. there is enough support from the community, bugs and issues are being resolved, functionality is reasonably extended, remains well decoupled from the rest of the code) we can potentially consider to make this part of the core ggml
library. But until then it will remain sort of a "second-class citizen".
@jart If that makes sense, we would need to put the ggml.c
change behind a define (e.g. GGML_USE_TINYBLAS
or GGML_USE_LLAMAFILE
or something like this), so that the sgemm
code becomes optional (we generally avoid such special cases, but we can make an exception this time). In llama.cpp
builds we can have this enabled by default as it seems it is always better than the alternatives. This way, llamafile
and other downstream projects can directly benefit from the changes, and we'll have more time to figure out what is the right way to integrate this into ggml
.
If you are OK with that, we can proceed to merge
common/common.cpp
Outdated
if (cpu_count < 1) | ||
return get_num_physical_cores(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (cpu_count < 1) | |
return get_num_physical_cores(); | |
if (cpu_count < 1) { | |
return get_num_physical_cores(); | |
} |
sgemm.cpp
Outdated
case GGML_TYPE_Q8_0: { | ||
if (k % 32) | ||
return false; | ||
if (Btype != GGML_TYPE_Q8_0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (Btype != GGML_TYPE_Q8_0) | |
if (Btype != GGML_TYPE_Q8_0) |
Sounds good @ggerganov. Review comments addressed in 492b76d PTAL |
Also just want to draw attention to the loosening of the |
Another thing worth mentioning, possibly for future iterations is that: template <int RM, int RN> void gemm(int m0, int m, int n0, int n) {
int ytiles = (m - m0) / RM;
int xtiles = (n - n0) / RN;
int tiles = xtiles * ytiles;
int duty = (tiles + nth - 1) / nth;
int start = duty * ith;
int end = start + duty;
if (end > tiles)
end = tiles;
for (int job = start; job < end; ++job) {
int ii = m0 + job / xtiles * RM;
int jj = n0 + job % xtiles * RN;
D Cv[RN][RM] = {0};
for (int l = 0; l < k; l += KN)
for (int j = 0; j < RN; ++j)
for (int i = 0; i < RM; ++i)
Cv[j][i] = madd(load(A + lda * (ii + i) + l), //
load(B + ldb * (jj + j) + l), //
Cv[j][i]);
TC Cd[RN][RM];
for (int j = 0; j < RN; ++j)
for (int i = 0; i < RM; ++i)
Cd[j][i] = hsum(Cv[j][i]);
for (int j = 0; j < RN; ++j)
for (int i = 0; i < RM; ++i)
C[ldc * (jj + j) + (ii + i)] = Cd[j][i];
}
} Is able to generate the handwritten kernels in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that Clang takes 45 seconds to compile it.
Not a good idea - the build time should not increase noticeably after these changes.
I did some more tests on M2 Ultra. Generally, text-generation (batch size = 1) and prompt processing speed (batch size > 256) are the most important metrics to look at, but keeping an eye on the performance for low-sized batches is also important (e.g. parallel decoding, speculative decoding, etc.)
The following command will give you the speed for various batch sizes:
./llama-bench -m models/mistral-instruct-7b-v0.2/ggml-model-f16.gguf -ngl 0 -p 1,2,3,4,5,6,7,8,12,16,32,64,512 -n 0 -r 50 -t 16
These are the numbers with the llamafile
SGEMM disabled:
LLAMA_NO_LLAMAFILE=1 LLAMA_NO_ACCELERATE=1 make -j llama-bench && ./llama-bench -m models/mistral-instruct-7b-v0.2/ggml-model-f16.gguf -ngl 0 -p 1,2,3,4,5,6,7,8,12,16,32,64,512 -n 0 -r 50 -t 16
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 1 | 15.67 ± 0.25 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 2 | 26.14 ± 0.78 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 3 | 32.99 ± 0.29 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 4 | 37.72 ± 0.48 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 5 | 39.51 ± 0.61 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 6 | 43.78 ± 0.50 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 7 | 45.72 ± 1.26 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 8 | 47.13 ± 1.35 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 12 | 51.81 ± 0.53 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 16 | 53.54 ± 1.59 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 32 | 55.89 ± 0.46 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 64 | 57.53 ± 0.31 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 512 | 58.16 ± 0.22 |
build: 492b76d (2645)
This is the same bench with llamafile
SGEMM enabled:
LLAMA_NO_ACCELERATE=1 make -j llama-bench && ./llama-bench -m models/mistral-instruct-7b-v0.2/ggml-model-f16.gguf -ngl 0 -p 1,2,3,4,5,6,7,8,12,16,32,64,512 -n 0 -r 50 -t 16
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 1 | 15.48 ± 0.73 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 2 | 25.94 ± 0.59 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 3 | 32.57 ± 1.29 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 4 | 37.63 ± 0.57 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 5 | 40.86 ± 1.22 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 6 | 43.59 ± 0.75 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 7 | 45.92 ± 0.40 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 8 | 33.38 ± 0.56 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 12 | 53.02 ± 0.58 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 16 | 69.40 ± 1.32 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 32 | 78.17 ± 0.57 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 64 | 101.11 ± 0.26 |
llama 7B F16 | 13.49 GiB | 7.24 B | Metal | 0 | pp 512 | 101.94 ± 0.70 |
build: 492b76d (2645)
For BS < 8
there is no difference since the SGEMM routines are not used, but at BS = 8
the SGEMM performs worse to mainline. Maybe there's room for improvement there.
It's also a good idea before merging to run some perplexity tests with F16
and Q4_0
7B LLaMA models to verify that the numbers are within expectation:
# use ./scripts/get-wikitext-2.sh to get wiki test data
# run ppl (can take a while)
./perplexity -f wikitext-2-raw/wiki.test.raw -m models/mistral-instruct-7b-v0.2/ggml-model-f16.gguf
scripts/sync-ggml-am.sh
Outdated
-e 's/src\/sgemm\.cpp/sgemm.cpp/g' \ | ||
-e 's/src\/sgemm\.h/sgemm.h/g' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to sync upstream for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ggml.c
Outdated
if (nb10 == ggml_type_size(src1->type)) { | ||
for (int64_t j = 0; j < ne13; j++) | ||
for (int64_t i = 0; i < ne12; i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition should be sufficient.
Instead of i
and j
use i12
and i13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@jart your work is wonderfull. and I think there is room for more optimisation. But some may need more control on this operator. But what if "TINYBLAS" is added as a backend (think like simd_backend...) [Append]: I read this PR: #5780 (comment) |
79705b2
to
2b83bf5
Compare
This change upstreams llamafile's cpu matrix multiplication kernels which improve image and prompt evaluation speed. For starters, Q4_0 and Q8_0 weights should go ~40% faster on CPU. The biggest benefits are with data types like f16 / f32, which process prompts 2x faster thus making them faster than quantized data types for prompt evals. This change also introduces bona fide AVX512 support since tinyBLAS is able to exploit the larger register file. For example, on my CPU llama.cpp llava-cli processes an image prompt at 305 tokens/second, using the Q4_K and Q4_0 types, which has always been faster than if we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With this change, f16 LLaVA performance leap frogs to 464 tokens/second. On Intel Core i9-14900K this change improves F16 prompt perf by 5x. For example, using llama.cpp at HEAD with Mistral 7b f16 to process a 215 token prompt will go 13 tok/sec. This change has fixes making it go 52 tok/sec. It's mostly thanks to my vectorized outer product kernels but also because I added support for correctly counting the number of cores on Alderlake, so the default thread count discounts Intel's new efficiency cores. Only Linux right now can count cores. This work was sponsored by Mozilla who's given permission to change the license of this code from Apache 2.0 to MIT. To read more about what's improved, and how it works, see: https://justine.lol/matmul/
@ggerganov Since my change doesn't help much on M2, I changed it to be off by default on that platform. #ifndef GGML_USE_LLAMAFILE
#ifdef __ARM_FEATURE_MATMUL_INT8
#define GGML_USE_LLAMAFILE 0
#else
#define GGML_USE_LLAMAFILE 1
#endif
#endif PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since my change doesn't help much on M2, I changed it to be off by default on that platform.
Apart from the dip at BS=8, on my machine it does help - at BS=512 the GEMM in this PR is almost 2x faster. This is with LLAMA_NO_ACCELERATE=1
though which disables the Apple's CBLAS implementation from the Accelerate
framework - for large BS this remains more efficient. Anyway, we can refine in the future
Regarding GGML_USE_LLAMAFILE
- as it is, when I upstream the changes to the ggml
repo, the build will fail because there is no sgemm.cpp
there. My idea was in the llama.cpp
Makefile and CMake to define GGML_USE_LLAMAFILE=1
by default (unless LLAMA_NO_LLAMAFILE
is set). I can of course add GGML_USE_LLAMAFILE=0
in the ggml
repo, but it's better to have this as the default for now
I vaguely recall when I was working in an experimental branch, the 8x3 kernel https://twitter.com/JustineTunney/status/1776440470152867930 would make GGML go faster than Accelerate. I've been reluctant to cause too much churn here in the interest of getting this PR in. Is there anything specific you need me to change on my end before this can be merged? |
I don't think the RPi5 uses the Accelerate framework. AFAIK it's available on Apple devices and the SGEMM that comes with it runs on some sort of specialized AMX coprocessor available in Apple Silicon, which brings extra performance to the table. |
This change upstreams llamafile's cpu matrix multiplication kernels which improve image and prompt evaluation speed. For starters, Q4_0 and Q8_0 weights should go ~40% faster on CPU. The biggest benefits are with data types like f16 / f32, which process prompts 2x faster thus making them faster than quantized data types for prompt evals. This change also introduces bona fide AVX512 support since tinyBLAS is able to exploit the larger register file. For example, on my CPU llama.cpp llava-cli processes an image prompt at 305 tokens/second, using the Q4_K and Q4_0 types, which has always been faster than if we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With this change, f16 LLaVA performance leap frogs to 464 tokens/second. On Intel Core i9-14900K this change improves F16 prompt perf by 5x. For example, using llama.cpp at HEAD with Mistral 7b f16 to process a 215 token prompt will go 13 tok/sec. This change has fixes making it go 52 tok/sec. It's mostly thanks to my vectorized outer product kernels but also because I added support for correctly counting the number of cores on Alderlake, so the default thread count discounts Intel's new efficiency cores. Only Linux right now can count cores. This work was sponsored by Mozilla who's given permission to change the license of this code from Apache 2.0 to MIT. To read more about what's improved, and how it works, see: https://justine.lol/matmul/
This change upstreams llamafile's cpu matrix multiplication kernels which improve image and prompt evaluation speed. For starters, Q4_0 and Q8_0 weights should go ~40% faster on CPU. The biggest benefits are with data types like f16 / f32, which process prompts 2x faster thus making them faster than quantized data types for prompt evals.
This change also introduces bona fide AVX512 support since tinyBLAS is able to exploit the larger register file. For example, on my CPU llama.cpp llava-cli processes an image prompt at 305 tokens/second, using the Q4_K and Q4_0 types, which has always been faster than if we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With this change, f16 LLaVA performance leap frogs to 464 tokens/second.
On Intel Core i9-14900K this change improves F16 prompt perf by 5x. For example, using llama.cpp at HEAD with Mistral 7b f16 to process a 215 token prompt will go 13 tok/sec. This change has fixes making it go 52 tok/sec. It's mostly thanks to my vectorized outer product kernels but also because I added support for correctly counting the number of cores on Alderlake, so the default thread count discounts Intel's new efficiency cores. Only Linux right now can count cores.
This work was sponsored by Mozilla who's given permission to change the license of this code from Apache 2.0 to MIT. To read more about what's improved, and how it works, see: https://justine.lol/matmul/