Improve cpu prompt eval speed #6414

jart · 2024-04-01T00:55:47Z

This change upstreams llamafile's cpu matrix multiplication kernels which improve image and prompt evaluation speed. For starters, Q4_0 and Q8_0 weights should go ~40% faster on CPU. The biggest benefits are with data types like f16 / f32, which process prompts 2x faster thus making them faster than quantized data types for prompt evals.

This change also introduces bona fide AVX512 support since tinyBLAS is able to exploit the larger register file. For example, on my CPU llama.cpp llava-cli processes an image prompt at 305 tokens/second, using the Q4_K and Q4_0 types, which has always been faster than if we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With this change, f16 LLaVA performance leap frogs to 464 tokens/second.

On Intel Core i9-14900K this change improves F16 prompt perf by 5x. For example, using llama.cpp at HEAD with Mistral 7b f16 to process a 215 token prompt will go 13 tok/sec. This change has fixes making it go 52 tok/sec. It's mostly thanks to my vectorized outer product kernels but also because I added support for correctly counting the number of cores on Alderlake, so the default thread count discounts Intel's new efficiency cores. Only Linux right now can count cores.

This work was sponsored by Mozilla who's given permission to change the license of this code from Apache 2.0 to MIT. To read more about what's improved, and how it works, see: https://justine.lol/matmul/

phymbert · 2024-04-01T07:01:10Z

Please fix the CI builds

github-actions · 2024-04-01T07:49:28Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 517 iterations 🚀

Concurrent users: 8, duration: 10m
HTTP request : avg=9017.64ms p(90)=25806.42ms fails=0, finish reason: stop=517 truncated=0
Prompt processing (pp): avg=238.16tk/s p(90)=711.8tk/s total=203.92tk/s
Token generation (tg): avg=95.48tk/s p(90)=248.93tk/s total=128.87tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=sgemm commit=8dbe58213391399b2e3b60b5b116b5dd6b864f96

Time series

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 517 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712197520 --> 1712198142
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 482.02, 482.02, 482.02, 482.02, 482.02, 480.93, 480.93, 480.93, 480.93, 480.93, 542.27, 542.27, 542.27, 542.27, 542.27, 599.19, 599.19, 599.19, 599.19, 599.19, 603.04, 603.04, 603.04, 603.04, 603.04, 623.99, 623.99, 623.99, 623.99, 623.99, 631.06, 631.06, 631.06, 631.06, 631.06, 631.12, 631.12, 631.12, 631.12, 631.12, 639.77, 639.77, 639.77, 639.77, 639.77, 656.83, 656.83, 656.83, 656.83, 656.83, 681.01, 681.01, 681.01, 681.01, 681.01, 677.07, 677.07, 677.07, 677.07, 677.07, 675.32, 675.32, 675.32, 675.32, 675.32, 640.6, 640.6, 640.6, 640.6, 640.6, 644.24, 644.24, 644.24, 644.24, 644.24, 643.62, 643.62, 643.62, 643.62, 643.62, 644.73, 644.73, 644.73, 644.73, 644.73, 643.23, 643.23, 643.23, 643.23, 643.23, 643.88, 643.88, 643.88, 643.88, 643.88, 643.15, 643.15, 643.15, 643.15, 643.15, 639.49, 639.49, 639.49, 639.49, 639.49, 638.15, 638.15, 638.15, 638.15, 638.15, 653.61, 653.61, 653.61, 653.61, 653.61, 654.36, 654.36, 654.36, 654.36, 654.36, 655.49, 655.49, 655.49, 655.49, 655.49, 664.2, 664.2, 664.2, 664.2, 664.2, 662.75, 662.75, 662.75, 662.75, 662.75, 660.8, 660.8, 660.8, 660.8, 660.8, 657.18, 657.18, 657.18, 657.18, 657.18, 656.13, 656.13, 656.13, 656.13, 656.13, 661.91, 661.91, 661.91, 661.91, 661.91, 660.87, 660.87, 660.87, 660.87, 660.87, 664.56, 664.56, 664.56, 664.56, 664.56, 676.0, 676.0, 676.0, 676.0, 676.0, 675.46, 675.46, 675.46, 675.46, 675.46, 680.42, 680.42, 680.42, 680.42, 680.42, 681.0, 681.0, 681.0, 681.0, 681.0, 680.06, 680.06, 680.06, 680.06, 680.06, 679.96, 679.96, 679.96, 679.96, 679.96, 684.29, 684.29, 684.29, 684.29, 684.29, 690.43, 690.43, 690.43, 690.43, 690.43, 680.82, 680.82, 680.82, 680.82, 680.82, 681.94, 681.94, 681.94, 681.94, 681.94, 681.94, 681.94, 681.94, 681.94, 681.94, 679.82, 679.82, 679.82, 679.82, 679.82, 678.17, 678.17, 678.17, 678.17, 678.17, 675.86, 675.86, 675.86, 675.86, 675.86, 679.6, 679.6, 679.6, 679.6, 679.6, 678.27, 678.27, 678.27, 678.27, 678.27, 678.13, 678.13, 678.13, 678.13, 678.13, 670.68, 670.68, 670.68, 670.68, 670.68, 670.78, 670.78, 670.78, 670.78, 670.78, 671.74, 671.74, 671.74, 671.74, 671.74, 672.1, 672.1, 672.1, 672.1, 672.1, 673.35, 673.35, 673.35, 673.35, 673.35, 672.79, 672.79, 672.79, 672.79, 672.79, 675.71, 675.71, 675.71, 675.71, 675.71, 671.67, 671.67, 671.67, 671.67, 671.67, 673.01, 673.01, 673.01, 673.01, 673.01, 672.08, 672.08]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 517 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712197520 --> 1712198142
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 28.56, 28.56, 28.56, 28.56, 28.56, 16.27, 16.27, 16.27, 16.27, 16.27, 17.36, 17.36, 17.36, 17.36, 17.36, 17.94, 17.94, 17.94, 17.94, 17.94, 19.34, 19.34, 19.34, 19.34, 19.34, 20.0, 20.0, 20.0, 20.0, 20.0, 20.07, 20.07, 20.07, 20.07, 20.07, 20.14, 20.14, 20.14, 20.14, 20.14, 19.99, 19.99, 19.99, 19.99, 19.99, 19.95, 19.95, 19.95, 19.95, 19.95, 19.93, 19.93, 19.93, 19.93, 19.93, 19.71, 19.71, 19.71, 19.71, 19.71, 19.45, 19.45, 19.45, 19.45, 19.45, 18.37, 18.37, 18.37, 18.37, 18.37, 18.44, 18.44, 18.44, 18.44, 18.44, 18.6, 18.6, 18.6, 18.6, 18.6, 18.69, 18.69, 18.69, 18.69, 18.69, 18.57, 18.57, 18.57, 18.57, 18.57, 18.4, 18.4, 18.4, 18.4, 18.4, 18.32, 18.32, 18.32, 18.32, 18.32, 18.18, 18.18, 18.18, 18.18, 18.18, 18.28, 18.28, 18.28, 18.28, 18.28, 18.32, 18.32, 18.32, 18.32, 18.32, 18.25, 18.25, 18.25, 18.25, 18.25, 18.36, 18.36, 18.36, 18.36, 18.36, 18.41, 18.41, 18.41, 18.41, 18.41, 18.36, 18.36, 18.36, 18.36, 18.36, 18.3, 18.3, 18.3, 18.3, 18.3, 18.01, 18.01, 18.01, 18.01, 18.01, 18.01, 18.01, 18.01, 18.01, 18.01, 18.06, 18.06, 18.06, 18.06, 18.06, 18.14, 18.14, 18.14, 18.14, 18.14, 18.25, 18.25, 18.25, 18.25, 18.25, 18.28, 18.28, 18.28, 18.28, 18.28, 18.24, 18.24, 18.24, 18.24, 18.24, 18.26, 18.26, 18.26, 18.26, 18.26, 18.12, 18.12, 18.12, 18.12, 18.12, 18.1, 18.1, 18.1, 18.1, 18.1, 18.13, 18.13, 18.13, 18.13, 18.13, 18.17, 18.17, 18.17, 18.17, 18.17, 18.24, 18.24, 18.24, 18.24, 18.24, 18.2, 18.2, 18.2, 18.2, 18.2, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 18.12, 17.82, 17.82, 17.82, 17.82, 17.82, 17.78, 17.78, 17.78, 17.78, 17.78, 17.43, 17.43, 17.43, 17.43, 17.43, 17.11, 17.11, 17.11, 17.11, 17.11, 17.13, 17.13, 17.13, 17.13, 17.13, 17.21, 17.21, 17.21, 17.21, 17.21, 17.28, 17.28, 17.28, 17.28, 17.28, 17.33, 17.33, 17.33, 17.33, 17.33, 17.38, 17.38, 17.38, 17.38, 17.38, 17.41, 17.41, 17.41, 17.41, 17.41, 17.4, 17.4, 17.4, 17.4, 17.4, 17.39, 17.39, 17.39, 17.39, 17.39, 17.35, 17.35, 17.35, 17.35, 17.35, 17.33, 17.33, 17.33, 17.33, 17.33, 17.36, 17.36, 17.36, 17.36, 17.36, 17.44, 17.44]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 517 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712197520 --> 1712198142
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.33, 0.33, 0.33, 0.33, 0.33, 0.07, 0.07, 0.07, 0.07, 0.07, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.3, 0.3, 0.3, 0.3, 0.3, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.31, 0.31, 0.31, 0.31, 0.31, 0.34, 0.34, 0.34, 0.34, 0.34, 0.37, 0.37, 0.37, 0.37, 0.37, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.08, 0.08, 0.08, 0.08, 0.08, 0.15, 0.15, 0.15, 0.15, 0.15, 0.27, 0.27, 0.27, 0.27, 0.27, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.23, 0.23, 0.23, 0.23, 0.23, 0.35, 0.35, 0.35, 0.35, 0.35, 0.45, 0.45, 0.45, 0.45, 0.45, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.53, 0.53, 0.53, 0.53, 0.53, 0.36, 0.36, 0.36, 0.36, 0.36, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.24, 0.24, 0.24, 0.24, 0.24, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.24, 0.24, 0.24, 0.24, 0.24, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 517 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712197520 --> 1712198142
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0]

JohannesGaessler · 2024-04-01T08:21:39Z

Some very quick tests on my Ryzen 5950X (power limited to 95 W):

Model	Threads	Test	t/s master	t/s	Speedup
llama 7B Q4_0	16	pp 512	24.60	32.34	1.31
llama 7B Q4_0	16	tg 128	9.75	9.86	1.01
llama 7B F16	16	pp 512	27.70	42.74	1.54
llama 7B F16	16	tg 128	3.20	3.19	1.00

A very respectable speedup!

Since you did not mention it in the OP, this PR does not touch the handling of NUMA nodes, correct?

kalomaze · 2024-04-01T08:41:14Z

Is this not yet set up to support the CPU code used in partial GPU offloading? Will those require custom kernels?

JohannesGaessler · 2024-04-01T08:47:33Z

Is this not yet set up to support the CPU code used in partial GPU offloading? Will those require custom kernels?

This PR will not speed up CPU+GPU hybrid inference in any meaningful capacity. For large batches you are compute bound and all of the evaluations are done on the GPU. For small batches you are I/O bound and better matrix multiplication algorithms make virtually no difference.

kalomaze · 2024-04-01T09:52:40Z

For large batches you are compute bound and all of the evaluations are done on the GPU.

Does this mean it moves layers onto the GPU for large batches instead of processing all GPU layers for the current batch and then doing the remaining layers on CPU? I'm sort of lost, this works against my current understanding (moving from CPU to GPU during inference should be slower)?

JohannesGaessler · 2024-04-01T10:46:02Z

CPU layers have their data in RAM. GPU layers have their data in VRAM. GPU layers are always evaluated on the GPU.

The most recent update is this PR: #6083 . For large batch sizes (prompt processing) all data of a CPU layer is moved to the GPU and the calculations are done there in order to make use of the higher GPU compute. For small batch sizes (token generation) CPU layers are evaluated on the CPU. This PR improves the compute efficiency of CPU matrix multiplication. So it only helps in those scenarios where it would also be worthwhile to temporarily move data to VRAM. The improvements in this PR are therefore mutually exclusive with CPU+GPU hybrid inference.

zougloub · 2024-04-01T16:07:46Z

@jart this is pretty awesome ; I would add that since a good portion of the contributed code is very generic and could benefit to many other downstream projects, it would be even more awesome if that code could be in its own repo ; then a subset could be linked or vendored in here.

jart · 2024-04-01T19:36:02Z

@phymbert Tests are green. Please take a look.

phymbert · 2024-04-01T19:41:30Z

@phymbert Tests are green. Please take a look.

Thank you very much for the contribution. For the core library and ggml changes @slaren and @ggerganov will revert to you.

jart · 2024-04-01T19:52:56Z

@zougloub Thank you for the encouragement. You can copy sgemm.cpp into your codebase as its own library if you provide an implementation for GGML_FP16_TO_FP32(). It would be challenging to create a bona fide library for this, because GEMM has more depth the more stakeholders you have. This code is written to focus only on what's good for llama.cpp and nothing else. The parallel implementation in the llamafile codebase does things a little differently, based on what's best there.

ggerganov · 2024-04-01T20:05:26Z

@jart Apologies for the slow response - will review the PRs in the following days. Thanks

jart · 2024-04-01T20:12:29Z

Thanks @ggerganov I'm in no rush.

netrunnereve · 2024-04-02T01:21:25Z

This PR did absolutely nothing for me on Q4_0 and Q8_0, then I realised that it only supported AVX2 and AVX512 for those quants. It does support regular AVX though for F16 and F32.

On my 4c/8t Xeon v2 I get a nice 2x speedup in F16. Just like vanilla llama.cpp you get the best CPU performance if you use all hyperthreads during prompt processing and switch to one thread per core for inference.

model	size	params	backend	threads	test	t/s
llama 1B F16	2.05 GiB	1.10 B	CPU	8	pp 512	33.36 ± 0.13
llama 1B F16	2.05 GiB	1.10 B	CPU	4	pp 512	32.19 ± 0.02
llama 1B F16 (PR)	2.05 GiB	1.10 B	CPU	8	pp 512	60.47 ± 0.06
llama 1B F16 (PR)	2.05 GiB	1.10 B	CPU	4	pp 512	52.88 ± 0.12

JohannesGaessler · 2024-04-02T10:21:29Z

@netrunnereve in case you're not aware, you can run ./llama-bench -o sql | sqlite3 llama-bench.sqlite both on master and a PR and then scripts/compare-llama-bench.py to generate a table with a performance comparison.

JeremyGe07 · 2024-04-02T10:28:09Z

Does this PR benefit ARM CPU?

tstanisl · 2024-04-02T11:46:30Z

ggml.c

+        return;
+    }
+UseGgmlGemm1:
+    (void)0;


That's to avoid the compiler complaining if the label comes before a variable declaration.

Yes. But using just ; is a simple way to achieve the same goal without introducing dummy expressions.. Just do:

UseGgmlGemm1: ;

It works perfectly fine since C99 standard. See https://godbolt.org/z/6sbKhnhW9 .

BTW. This issue with labels is fixed in C23 standard.

Thank you @tstanisl, you taught me something I didn't know. Fixed!

tstanisl · 2024-04-02T11:46:40Z

ggml.c

+        return;
+    }
+UseGgmlGemm2:
+    (void)0;


sorasoras · 2024-04-02T12:31:20Z

Does this PR benefit ARM CPU?

I think so.
it has to be ARMV8.2+ I guess.
https://justine.lol/matmul/

Jipok · 2024-04-02T20:03:28Z

Should I have acceleration for Q8 if I only have AVX and AVX2? I tested and found no differences.
Do I need to build with some kind of blas?

phymbert · 2024-04-02T21:19:42Z

https://justine.lol/matmul/ is a must read ^^) Thank you @jart, you got a new Patron

netrunnereve · 2024-04-05T04:05:08Z

Nice work @netrunnereve. Would you be interested in contributing your change to llamafile too?

After this PR is merged I'll probably polish up the AVX implementation and submit it back up to llama.cpp. Since your llamafile is based off llama.cpp any updates that we make here should eventually get pulled into your project. Or you can just pick it up now and use it as you see fit in llamafile, I don't mind.

moshemalawach · 2024-04-07T22:23:16Z

Using it on many CPU setups and it speeds up everything on context processing!

ZelinMa557 · 2024-04-08T03:53:15Z

Can these kernels make the token generation faster?

lin72h · 2024-04-08T06:36:06Z

Can these kernels make the token generation faster?

I think it probably not, because token generation is memory bandwidth bound

ggerganov

CPU speedups are always welcome, but I’m worried about the maintenance efforts for the core ggml library increasing, so I’m still hesitating how to proceed with this PR.

Similar discussion was already had in #5780 and there are likely to be other matrix-multiplication improvements proposed:

This change on one hand is well decoupled which is good, but at the same time introduces a new block-wise matrix-multiplication pattern that is different from the existing dot-based implementations. It’s obviously significantly more performant since it utilizes the CPU cache much more efficiently, which has not been the case so far. It also seems that the implementation can be extended to more instruction sets and quantum types in the future, so the amount of code has the potential to grow significantly.

The code is also in C++, while we generally prefer to keep the core implementation in C and allow C++ only in the backend implementations when desired. I’ve been pretty stubborn with this C requirement and it’s probably something to finally reconsider, but it’s not the time to decide in this PR.

I don’t want to delay this for much longer as I’ve already given this quite some thought and haven’t come to a good conclusion. I think the comments in #5780 apply to a good extend here (PTAL), so my suggestion is that we aim for this to become part of the future BLAS/matmul backend. The benefit of doing that is that the code becomes sort of an "extension" to ggml and can be developed more independently, without drawing a lot of attention from the core maintainers.

In the meantime, we can merge this change and depending on how the development process goes (i.e. there is enough support from the community, bugs and issues are being resolved, functionality is reasonably extended, remains well decoupled from the rest of the code) we can potentially consider to make this part of the core ggml library. But until then it will remain sort of a "second-class citizen".

@jart If that makes sense, we would need to put the ggml.c change behind a define (e.g. GGML_USE_TINYBLAS or GGML_USE_LLAMAFILE or something like this), so that the sgemm code becomes optional (we generally avoid such special cases, but we can make an exception this time). In llama.cpp builds we can have this enabled by default as it seems it is always better than the alternatives. This way, llamafile and other downstream projects can directly benefit from the changes, and we'll have more time to figure out what is the right way to integrate this into ggml.

If you are OK with that, we can proceed to merge

common/common.cpp

ggerganov · 2024-04-10T10:24:37Z

common/common.cpp

+    if (cpu_count < 1)
+        return get_num_physical_cores();


Suggested change

if (cpu_count < 1)

return get_num_physical_cores();

if (cpu_count < 1) {

return get_num_physical_cores();

}

common/common.cpp

ggerganov · 2024-04-10T11:00:07Z

sgemm.cpp

+    case GGML_TYPE_Q8_0: {
+        if (k % 32)
+            return false;
+       if (Btype != GGML_TYPE_Q8_0)


Suggested change

if (Btype != GGML_TYPE_Q8_0)

if (Btype != GGML_TYPE_Q8_0)

sgemm.cpp

jart · 2024-04-11T07:28:18Z

Sounds good @ggerganov. Review comments addressed in 492b76d PTAL

jart · 2024-04-11T07:31:15Z

Also just want to draw attention to the loosening of the src1_cont restriction. Could you confirm that's correct?

jart · 2024-04-11T07:40:09Z

Another thing worth mentioning, possibly for future iterations is that:

    template <int RM, int RN> void gemm(int m0, int m, int n0, int n) {
        int ytiles = (m - m0) / RM;
        int xtiles = (n - n0) / RN;
        int tiles = xtiles * ytiles;
        int duty = (tiles + nth - 1) / nth;
        int start = duty * ith;
        int end = start + duty;
        if (end > tiles)
            end = tiles;
        for (int job = start; job < end; ++job) {
            int ii = m0 + job / xtiles * RM;
            int jj = n0 + job % xtiles * RN;
            D Cv[RN][RM] = {0};
            for (int l = 0; l < k; l += KN)
                for (int j = 0; j < RN; ++j)
                    for (int i = 0; i < RM; ++i)
                        Cv[j][i] = madd(load(A + lda * (ii + i) + l), //
                                        load(B + ldb * (jj + j) + l), //
                                        Cv[j][i]);
            TC Cd[RN][RM];
            for (int j = 0; j < RN; ++j)
                for (int i = 0; i < RM; ++i)
                    Cd[j][i] = hsum(Cv[j][i]);
            for (int j = 0; j < RN; ++j)
                for (int i = 0; i < RM; ++i)
                    C[ldc * (jj + j) + (ii + i)] = Cd[j][i];
        }
    }

Is able to generate the handwritten kernels in the tinyBLAS class. This makes it possible to generate an mnpack() method that optimally handles all edge cases for weirdly shaped n and m values. See https://gist.github.com/jart/640231a627dfbd02fb03e23e8b01e592#file-matmul-cpp-L295-L609 for an example. The issue is that Clang takes 45 seconds to compile it. Would you want me to simplify the code so it's more abstract but potentially slower to compile?

ggerganov

The issue is that Clang takes 45 seconds to compile it.

Not a good idea - the build time should not increase noticeably after these changes.

I did some more tests on M2 Ultra. Generally, text-generation (batch size = 1) and prompt processing speed (batch size > 256) are the most important metrics to look at, but keeping an eye on the performance for low-sized batches is also important (e.g. parallel decoding, speculative decoding, etc.)

The following command will give you the speed for various batch sizes:

./llama-bench -m models/mistral-instruct-7b-v0.2/ggml-model-f16.gguf -ngl 0 -p 1,2,3,4,5,6,7,8,12,16,32,64,512 -n 0 -r 50 -t 16

These are the numbers with the llamafile SGEMM disabled:

LLAMA_NO_LLAMAFILE=1 LLAMA_NO_ACCELERATE=1 make -j llama-bench && ./llama-bench -m models/mistral-instruct-7b-v0.2/ggml-model-f16.gguf -ngl 0 -p 1,2,3,4,5,6,7,8,12,16,32,64,512 -n 0 -r 50 -t 16

model	size	params	backend	test	t/s
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 1	15.67 ± 0.25
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 2	26.14 ± 0.78
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 3	32.99 ± 0.29
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 4	37.72 ± 0.48
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 5	39.51 ± 0.61
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 6	43.78 ± 0.50
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 7	45.72 ± 1.26
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 8	47.13 ± 1.35
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 12	51.81 ± 0.53
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 16	53.54 ± 1.59
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 32	55.89 ± 0.46
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 64	57.53 ± 0.31
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 512	58.16 ± 0.22

build: 492b76d (2645)

This is the same bench with llamafile SGEMM enabled:

LLAMA_NO_ACCELERATE=1 make -j llama-bench && ./llama-bench -m models/mistral-instruct-7b-v0.2/ggml-model-f16.gguf -ngl 0 -p 1,2,3,4,5,6,7,8,12,16,32,64,512 -n 0 -r 50 -t 16

model	size	params	backend	test	t/s
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 1	15.48 ± 0.73
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 2	25.94 ± 0.59
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 3	32.57 ± 1.29
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 4	37.63 ± 0.57
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 5	40.86 ± 1.22
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 6	43.59 ± 0.75
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 7	45.92 ± 0.40
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 8	33.38 ± 0.56
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 12	53.02 ± 0.58
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 16	69.40 ± 1.32
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 32	78.17 ± 0.57
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 64	101.11 ± 0.26
llama 7B F16	13.49 GiB	7.24 B	Metal	pp 512	101.94 ± 0.70

build: 492b76d (2645)

For BS < 8 there is no difference since the SGEMM routines are not used, but at BS = 8 the SGEMM performs worse to mainline. Maybe there's room for improvement there.

It's also a good idea before merging to run some perplexity tests with F16 and Q4_0 7B LLaMA models to verify that the numbers are within expectation:

# use ./scripts/get-wikitext-2.sh to get wiki test data

# run ppl (can take a while)
./perplexity -f wikitext-2-raw/wiki.test.raw -m models/mistral-instruct-7b-v0.2/ggml-model-f16.gguf

ggerganov · 2024-04-11T11:54:32Z

scripts/sync-ggml-am.sh

+        -e 's/src\/sgemm\.cpp/sgemm.cpp/g' \
+        -e 's/src\/sgemm\.h/sgemm.h/g' \


No need to sync upstream for now

ggerganov · 2024-04-11T13:06:33Z

ggml.c

+    if (nb10 == ggml_type_size(src1->type)) {
+        for (int64_t j = 0; j < ne13; j++)
+            for (int64_t i = 0; i < ne12; i++)


The condition should be sufficient.

Instead of i and j use i12 and i13

Djip007 · 2024-04-11T20:37:11Z

@jart your work is wonderfull. and I think there is room for more optimisation. But some may need more control on this operator.
@ggerganov is worried with the size / maintenance of ggml core.

But what if "TINYBLAS" is added as a backend (think like simd_backend...)
I've spent the last few days trying to figure out the design of llama.cpp and backend. Look that if "TINYBLAS" is a backend you can have even more control over what you can implement (choice of block size, storage architecture, etc.)

[Append]: I read this PR: #5780 (comment)
It seems that there are already discussions about how to handle rearranged tensor and the use of new backends ..

This change upstreams llamafile's cpu matrix multiplication kernels which improve image and prompt evaluation speed. For starters, Q4_0 and Q8_0 weights should go ~40% faster on CPU. The biggest benefits are with data types like f16 / f32, which process prompts 2x faster thus making them faster than quantized data types for prompt evals. This change also introduces bona fide AVX512 support since tinyBLAS is able to exploit the larger register file. For example, on my CPU llama.cpp llava-cli processes an image prompt at 305 tokens/second, using the Q4_K and Q4_0 types, which has always been faster than if we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With this change, f16 LLaVA performance leap frogs to 464 tokens/second. On Intel Core i9-14900K this change improves F16 prompt perf by 5x. For example, using llama.cpp at HEAD with Mistral 7b f16 to process a 215 token prompt will go 13 tok/sec. This change has fixes making it go 52 tok/sec. It's mostly thanks to my vectorized outer product kernels but also because I added support for correctly counting the number of cores on Alderlake, so the default thread count discounts Intel's new efficiency cores. Only Linux right now can count cores. This work was sponsored by Mozilla who's given permission to change the license of this code from Apache 2.0 to MIT. To read more about what's improved, and how it works, see: https://justine.lol/matmul/

jart · 2024-04-16T02:48:00Z

@ggerganov Since my change doesn't help much on M2, I changed it to be off by default on that platform.

#ifndef GGML_USE_LLAMAFILE
#ifdef __ARM_FEATURE_MATMUL_INT8
#define GGML_USE_LLAMAFILE 0
#else
#define GGML_USE_LLAMAFILE 1
#endif
#endif

PTAL

github-actions · 2024-04-16T03:26:57Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 462 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=10248.29ms p(95)=27753.38ms fails=, finish reason: stop=409 truncated=53
Prompt processing (pp): avg=113.07tk/s p(95)=512.58tk/s
Token generation (tg): avg=23.87tk/s p(95)=36.95tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=sgemm commit=183c4bb3656f2842a8871df25e6fb8e1abe18f3f

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 462 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1713237376 --> 1713238010
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 413.54, 413.54, 413.54, 413.54, 413.54, 473.28, 473.28, 473.28, 473.28, 473.28, 434.54, 434.54, 434.54, 434.54, 434.54, 458.69, 458.69, 458.69, 458.69, 458.69, 487.66, 487.66, 487.66, 487.66, 487.66, 538.06, 538.06, 538.06, 538.06, 538.06, 541.98, 541.98, 541.98, 541.98, 541.98, 542.13, 542.13, 542.13, 542.13, 542.13, 568.74, 568.74, 568.74, 568.74, 568.74, 576.84, 576.84, 576.84, 576.84, 576.84, 578.57, 578.57, 578.57, 578.57, 578.57, 586.56, 586.56, 586.56, 586.56, 586.56, 608.9, 608.9, 608.9, 608.9, 608.9, 616.35, 616.35, 616.35, 616.35, 616.35, 624.34, 624.34, 624.34, 624.34, 624.34, 609.54, 609.54, 609.54, 609.54, 609.54, 606.37, 606.37, 606.37, 606.37, 606.37, 606.82, 606.82, 606.82, 606.82, 606.82, 610.89, 610.89, 610.89, 610.89, 610.89, 610.73, 610.73, 610.73, 610.73, 610.73, 624.11, 624.11, 624.11, 624.11, 624.11, 627.85, 627.85, 627.85, 627.85, 627.85, 620.81, 620.81, 620.81, 620.81, 620.81, 620.59, 620.59, 620.59, 620.59, 620.59, 626.44, 626.44, 626.44, 626.44, 626.44, 627.28, 627.28, 627.28, 627.28, 627.28, 631.57, 631.57, 631.57, 631.57, 631.57, 645.0, 645.0, 645.0, 645.0, 645.0, 642.83, 642.83, 642.83, 642.83, 642.83, 646.15, 646.15, 646.15, 646.15, 646.15, 647.61, 647.61, 647.61, 647.61, 647.61, 652.23, 652.23, 652.23, 652.23, 652.23, 652.74, 652.74, 652.74, 652.74, 652.74, 652.66, 652.66, 652.66, 652.66, 652.66, 653.88, 653.88, 653.88, 653.88, 653.88, 659.16, 659.16, 659.16, 659.16, 659.16, 659.52, 659.52, 659.52, 659.52, 659.52, 659.72, 659.72, 659.72, 659.72, 659.72, 662.16, 662.16, 662.16, 662.16, 662.16, 662.14, 662.14, 662.14, 662.14, 662.14, 670.43, 670.43, 670.43, 670.43, 670.43, 675.83, 675.83, 675.83, 675.83, 675.83, 681.14, 681.14, 681.14, 681.14, 681.14, 684.43, 684.43, 684.43, 684.43, 684.43, 683.68, 683.68, 683.68, 683.68, 683.68, 683.12, 683.12, 683.12, 683.12, 683.12, 681.99, 681.99, 681.99, 681.99, 681.99, 685.48, 685.48, 685.48, 685.48, 685.48, 687.78, 687.78, 687.78, 687.78, 687.78, 689.88, 689.88, 689.88, 689.88, 689.88, 685.92, 685.92, 685.92, 685.92, 685.92, 671.1, 671.1, 671.1, 671.1, 671.1, 670.62, 670.62, 670.62, 670.62, 670.62, 669.47, 669.47, 669.47, 669.47, 669.47, 669.37, 669.37, 669.37, 669.37, 669.37, 663.64, 663.64, 663.64, 663.64, 663.64, 666.21, 666.21, 666.21, 666.21, 666.21, 666.59, 666.59, 666.59, 666.59, 666.59, 670.74, 670.74, 670.74, 670.74, 670.74, 671.14, 671.14, 671.14, 671.14, 671.14, 672.09, 672.09, 672.09, 672.09, 672.09, 672.09, 672.09, 672.09]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 462 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1713237376 --> 1713238010
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 33.07, 33.07, 33.07, 33.07, 33.07, 33.11, 33.11, 33.11, 33.11, 33.11, 22.34, 22.34, 22.34, 22.34, 22.34, 22.6, 22.6, 22.6, 22.6, 22.6, 23.07, 23.07, 23.07, 23.07, 23.07, 23.18, 23.18, 23.18, 23.18, 23.18, 23.3, 23.3, 23.3, 23.3, 23.3, 24.38, 24.38, 24.38, 24.38, 24.38, 25.48, 25.48, 25.48, 25.48, 25.48, 25.49, 25.49, 25.49, 25.49, 25.49, 25.42, 25.42, 25.42, 25.42, 25.42, 24.67, 24.67, 24.67, 24.67, 24.67, 24.58, 24.58, 24.58, 24.58, 24.58, 24.48, 24.48, 24.48, 24.48, 24.48, 23.98, 23.98, 23.98, 23.98, 23.98, 23.81, 23.81, 23.81, 23.81, 23.81, 23.32, 23.32, 23.32, 23.32, 23.32, 23.0, 23.0, 23.0, 23.0, 23.0, 22.85, 22.85, 22.85, 22.85, 22.85, 23.13, 23.13, 23.13, 23.13, 23.13, 23.16, 23.16, 23.16, 23.16, 23.16, 22.72, 22.72, 22.72, 22.72, 22.72, 22.49, 22.49, 22.49, 22.49, 22.49, 22.22, 22.22, 22.22, 22.22, 22.22, 22.14, 22.14, 22.14, 22.14, 22.14, 22.02, 22.02, 22.02, 22.02, 22.02, 22.07, 22.07, 22.07, 22.07, 22.07, 22.24, 22.24, 22.24, 22.24, 22.24, 22.16, 22.16, 22.16, 22.16, 22.16, 22.32, 22.32, 22.32, 22.32, 22.32, 22.38, 22.38, 22.38, 22.38, 22.38, 22.37, 22.37, 22.37, 22.37, 22.37, 22.17, 22.17, 22.17, 22.17, 22.17, 22.04, 22.04, 22.04, 22.04, 22.04, 22.11, 22.11, 22.11, 22.11, 22.11, 22.4, 22.4, 22.4, 22.4, 22.4, 22.51, 22.51, 22.51, 22.51, 22.51, 22.64, 22.64, 22.64, 22.64, 22.64, 22.72, 22.72, 22.72, 22.72, 22.72, 22.74, 22.74, 22.74, 22.74, 22.74, 22.63, 22.63, 22.63, 22.63, 22.63, 22.51, 22.51, 22.51, 22.51, 22.51, 22.5, 22.5, 22.5, 22.5, 22.5, 22.3, 22.3, 22.3, 22.3, 22.3, 22.27, 22.27, 22.27, 22.27, 22.27, 22.26, 22.26, 22.26, 22.26, 22.26, 22.25, 22.25, 22.25, 22.25, 22.25, 22.34, 22.34, 22.34, 22.34, 22.34, 22.54, 22.54, 22.54, 22.54, 22.54, 22.61, 22.61, 22.61, 22.61, 22.61, 22.42, 22.42, 22.42, 22.42, 22.42, 22.15, 22.15, 22.15, 22.15, 22.15, 22.08, 22.08, 22.08, 22.08, 22.08, 21.89, 21.89, 21.89, 21.89, 21.89, 21.35, 21.35, 21.35, 21.35, 21.35, 21.33, 21.33, 21.33, 21.33, 21.33, 21.29, 21.29, 21.29, 21.29, 21.29, 21.36, 21.36, 21.36, 21.36, 21.36, 21.45, 21.45, 21.45, 21.45, 21.45, 21.47, 21.47, 21.47, 21.47, 21.47, 21.61, 21.61, 21.61, 21.61, 21.61, 21.66, 21.66, 21.66]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 462 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1713237376 --> 1713238010
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.4, 0.4, 0.4, 0.4, 0.4, 0.26, 0.26, 0.26, 0.26, 0.26, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.29, 0.29, 0.29, 0.29, 0.29, 0.25, 0.25, 0.25, 0.25, 0.25, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.29, 0.29, 0.29, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.32, 0.32, 0.32, 0.32, 0.32, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.26, 0.26, 0.26, 0.26, 0.26, 0.21, 0.21, 0.21, 0.21, 0.21, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.31, 0.31, 0.31, 0.31, 0.31, 0.39, 0.39, 0.39, 0.39, 0.39, 0.41, 0.41, 0.41, 0.41, 0.41, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.37, 0.35, 0.35, 0.35, 0.35, 0.35, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 462 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1713237376 --> 1713238010
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0]

ggerganov

Since my change doesn't help much on M2, I changed it to be off by default on that platform.

Apart from the dip at BS=8, on my machine it does help - at BS=512 the GEMM in this PR is almost 2x faster. This is with LLAMA_NO_ACCELERATE=1 though which disables the Apple's CBLAS implementation from the Accelerate framework - for large BS this remains more efficient. Anyway, we can refine in the future

Regarding GGML_USE_LLAMAFILE - as it is, when I upstream the changes to the ggml repo, the build will fail because there is no sgemm.cpp there. My idea was in the llama.cpp Makefile and CMake to define GGML_USE_LLAMAFILE=1 by default (unless LLAMA_NO_LLAMAFILE is set). I can of course add GGML_USE_LLAMAFILE=0 in the ggml repo, but it's better to have this as the default for now

jart · 2024-04-16T18:34:27Z

I vaguely recall when I was working in an experimental branch, the 8x3 kernel https://twitter.com/JustineTunney/status/1776440470152867930 would make GGML go faster than Accelerate. I've been reluctant to cause too much churn here in the interest of getting this PR in. Is there anything specific you need me to change on my end before this can be merged?

ggerganov · 2024-04-16T18:59:52Z

the 8x3 kernel twitter.com/JustineTunney/status/1776440470152867930 would make GGML go faster than Accelerate

I don't think the RPi5 uses the Accelerate framework. AFAIK it's available on Apple devices and the SGEMM that comes with it runs on some sort of specialized AMX coprocessor available in Apple Silicon, which brings extra performance to the table.

This change upstreams llamafile's cpu matrix multiplication kernels which improve image and prompt evaluation speed. For starters, Q4_0 and Q8_0 weights should go ~40% faster on CPU. The biggest benefits are with data types like f16 / f32, which process prompts 2x faster thus making them faster than quantized data types for prompt evals. This change also introduces bona fide AVX512 support since tinyBLAS is able to exploit the larger register file. For example, on my CPU llama.cpp llava-cli processes an image prompt at 305 tokens/second, using the Q4_K and Q4_0 types, which has always been faster than if we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With this change, f16 LLaVA performance leap frogs to 464 tokens/second. On Intel Core i9-14900K this change improves F16 prompt perf by 5x. For example, using llama.cpp at HEAD with Mistral 7b f16 to process a 215 token prompt will go 13 tok/sec. This change has fixes making it go 52 tok/sec. It's mostly thanks to my vectorized outer product kernels but also because I added support for correctly counting the number of cores on Alderlake, so the default thread count discounts Intel's new efficiency cores. Only Linux right now can count cores. This work was sponsored by Mozilla who's given permission to change the license of this code from Apache 2.0 to MIT. To read more about what's improved, and how it works, see: https://justine.lol/matmul/

jart force-pushed the sgemm branch from 3b514d4 to 26d3614 Compare April 1, 2024 07:35

jart force-pushed the sgemm branch from 26d3614 to e867339 Compare April 1, 2024 07:52

phymbert mentioned this pull request Apr 1, 2024

server: bench: continuous performance testing #6233

Closed

16 tasks

jart force-pushed the sgemm branch from e867339 to 20b2d6c Compare April 1, 2024 08:14

jart force-pushed the sgemm branch from 20b2d6c to 08f10ec Compare April 1, 2024 10:21

jart force-pushed the sgemm branch from 08f10ec to 7fb769f Compare April 1, 2024 13:55

jart force-pushed the sgemm branch from 7fb769f to bb6ebca Compare April 2, 2024 11:10

tstanisl reviewed Apr 2, 2024

View reviewed changes

ggml.c Outdated

return;

}

UseGgmlGemm2:

(void)0;

Copy link

tstanisl Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

jart force-pushed the sgemm branch from bb6ebca to 83a2f14 Compare April 2, 2024 11:59

jart mentioned this pull request Apr 2, 2024

Introduce bfloat16 support #6412

Merged

ggerganov mentioned this pull request Apr 8, 2024

add loongarch lsx and lasx optimize code #6454

Merged

ggerganov reviewed Apr 10, 2024

View reviewed changes

liqunfu reviewed Apr 11, 2024

View reviewed changes

sgemm.cpp Outdated Show resolved Hide resolved

jart force-pushed the sgemm branch from 8dbe582 to 492b76d Compare April 11, 2024 07:27

ggerganov reviewed Apr 11, 2024

View reviewed changes

jart force-pushed the sgemm branch 2 times, most recently from 79705b2 to 2b83bf5 Compare April 16, 2024 02:45

jart force-pushed the sgemm branch from 2b83bf5 to 183c4bb Compare April 16, 2024 02:46

ggerganov approved these changes Apr 16, 2024

View reviewed changes

ggerganov merged commit 8cc91dc into ggerganov:master Apr 16, 2024
62 checks passed

ggerganov mentioned this pull request Apr 16, 2024

ggml : fix llamafile sgemm wdata offsets #6710

Merged

ggerganov mentioned this pull request Apr 20, 2024

common : try to fix Android CI #6780

Merged

netrunnereve mentioned this pull request Apr 25, 2024

AVX Q4_0 and Q8_0 sgemm #6891

Merged

jart mentioned this pull request Apr 26, 2024

Question Mozilla-Ocho/llamafile#363

Closed

yirongjie mentioned this pull request Jul 30, 2024

perf: add AArch64 GEMM/GEMV for q4_0. UbiquitousLearning/mllm#104

Merged

		-e 's/src\/sgemm\.cpp/sgemm.cpp/g' \
		-e 's/src\/sgemm\.h/sgemm.h/g' \

Improve cpu prompt eval speed #6414

Improve cpu prompt eval speed #6414

Conversation

jart commented Apr 1, 2024

phymbert commented Apr 1, 2024

github-actions bot commented Apr 1, 2024 • edited Loading

JohannesGaessler commented Apr 1, 2024

kalomaze commented Apr 1, 2024

JohannesGaessler commented Apr 1, 2024

kalomaze commented Apr 1, 2024 • edited Loading

JohannesGaessler commented Apr 1, 2024

zougloub commented Apr 1, 2024

jart commented Apr 1, 2024

phymbert commented Apr 1, 2024

jart commented Apr 1, 2024

ggerganov commented Apr 1, 2024

jart commented Apr 1, 2024

netrunnereve commented Apr 2, 2024

JohannesGaessler commented Apr 2, 2024

JeremyGe07 commented Apr 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tstanisl Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sorasoras commented Apr 2, 2024 • edited Loading

Jipok commented Apr 2, 2024 • edited Loading

phymbert commented Apr 2, 2024

netrunnereve commented Apr 5, 2024 • edited Loading

moshemalawach commented Apr 7, 2024

ZelinMa557 commented Apr 8, 2024

lin72h commented Apr 8, 2024

ggerganov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jart commented Apr 11, 2024

jart commented Apr 11, 2024

jart commented Apr 11, 2024

ggerganov left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Djip007 commented Apr 11, 2024 • edited Loading

jart commented Apr 16, 2024

github-actions bot commented Apr 16, 2024

ggerganov left a comment

Choose a reason for hiding this comment

jart commented Apr 16, 2024

ggerganov commented Apr 16, 2024

github-actions bot commented Apr 1, 2024 •

edited

Loading

kalomaze commented Apr 1, 2024 •

edited

Loading

tstanisl Apr 3, 2024 •

edited

Loading

sorasoras commented Apr 2, 2024 •

edited

Loading

Jipok commented Apr 2, 2024 •

edited

Loading

netrunnereve commented Apr 5, 2024 •

edited

Loading

ggerganov left a comment •

edited

Loading

Djip007 commented Apr 11, 2024 •

edited

Loading