[SYCL] Caching device_info in device_ext to restore TG performance #8301

OuadiElfarouki · 2024-07-04T15:05:35Z

This patch introduces a caching mechanism within device_ext to persist the device_info-s within the dev_mgr singleton instance.
This particularly lifts the runtime overhead introduced by the current get_work_group_size implementation that triggers unnecessary device prop queries only to grab the wg size, namely for norm and softmax operations.

Performance of Text Generation specifically got recovered :
Nvidia A100 + 70B Q4_K : 2.0 t/s -> 10.5 t/s (~ x5)
Nvidia A100 + 13B Q4_K : 4.25 t/s -> 42.5 t/s (~ x10)

Pinging @joeatodd @airMeng @AidanBeltonS @abhilash1910

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ngxson · 2024-07-04T15:25:46Z

Probably related to #8286

OuadiElfarouki · 2024-07-04T16:09:45Z

Probably related to #8286

Yes indeed there has been some overlap there. Will see how we can sort this out.

mistrjirka · 2024-07-04T19:25:26Z

This patch introduces a caching mechanism within device_ext to persist the device_info-s within the dev_mgr singleton instance. This particularly lifts the runtime overhead introduced by the current get_work_group_size implementation that triggers unnecessary device prop queries only to grab the wg size, namely for norm and softmax operations.

Performance of Text Generation specifically got recovered : Nvidia A100 + 70B Q4_K : 2.0 t/s -> 10.5 t/s (~ x5) Nvidia A100 + 13B Q4_K : 4.25 t/s -> 42.5 t/s (~ x10)

Pinging @joeatodd @airMeng @AidanBeltonS @abhilash1910
* [x]  I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)

* Self-reported review complexity:
  
  * [ ]  Low
  * [x]  Medium
  * [ ]  High

I attempted to compile and try it on my integrated Intel GPU. It compiles but it does not work on inference. I tried command
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-bench -o md -m ~/llmmodels/Lexi-Llama-3-8B-Uncensored_Q4_K_M.gguf -n 16 -p 256 -ngl 33 -sm none -mg 0 but it started to fail on some assert:
log.txt

OuadiElfarouki · 2024-07-04T19:46:08Z

Thanks for flagging that @mistrjirka, we haven't tested the recent LLama 3 on our SYCL backend end yet. Is this is an error you get on the master branch as well or is it introduced in this patch specifically?

airMeng · 2024-07-05T00:59:28Z

I think the solution of #8286 is more explicit, ggml_sycl_info() is exactly designed to query these specs

OuadiElfarouki · 2024-07-05T03:05:39Z

Hi @airMeng
I see that the fix in #8286 tackles the issue directly as a work-around.
Still I don't agree that ggml_sycl_info() is meant to hold all SYCL interface related properties whenever we need one, but instead the bare hardware's compute/memory capabilities that shall help llama.cpp decide on certain execution paths for operators that support or not certain hardware features for instance (that's why we have the CC & the vram size there for example).
The wg_size on the other hand serves as a kernel configuration, kernels and paths being the same, so it's more like an adaptive performance tuning information.

NeoZhangJianyu · 2024-07-05T04:06:31Z

max_work_group_size is an integer data.
for device, it's a fixed value.

So, in PR #8286, I use the local cache ( int array) in the static class ggml_sycl_info():

detect all devices' max_work_group_size and save them in int array during startup,
access it by ctx.device which is integer too.

This solution is a formal solution, instead of work around.
It's simple and quickly.
We could use similar solution for all other device info.

Because the static class instance ggml_sycl_info() is const, and all access are read only.
no need to add mux(), that avoid performance reduce.

In this PR, use complex code with mux().
So, ... :)

mistrjirka · 2024-07-05T09:30:14Z

Thanks for flagging that, @mistrjirka; we haven't tested the recent LLama 3 on our SYCL backend end yet. Is this is an error you get on the master branch as well or is it introduced in this patch specifically?

I am sorry I accidentally was on a wrong branch. I tested it on the correct branch now, and I can confirm that it works without problem. I do not observe any performance improvement, but I suppose that is because A100 and Iris Xe are quite different GPUs.

[jirka@Sidewinder llama.cpp]$ ./build/bin/llama-bench -o md -m ~/llmmodels/Lexi-Llama-3-8B-Uncensored_Q4_K_M.gguf -n 16 -p 256 -ngl 33 -sm none -mg 0
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 4 SYCL devices:
| model                          |       size |     params | backend    | ngl |    sm |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | ------------: | ---------------: |
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 4 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                 Intel Iris Xe Graphics|    1.3|     96|     512|   32| 15083M|            1.3.29377|
| 1|     [opencl:gpu:0]|                 Intel Iris Xe Graphics|    3.0|     96|     512|   32| 15083M|          24.17.29377|
| 2|     [opencl:cpu:0]|11th Gen Intel Core i7-1165G7 @ 2.80GHz|    3.0|      8|    8192|   64| 16547M|2023.16.10.0.17_160000|
| 3|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|      8|67108864|   64| 16547M|2023.16.10.0.17_160000|
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | SYCL       |  33 |  none |         pp256 |     27.08 ± 0.29 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | SYCL       |  33 |  none |          tg16 |      3.53 ± 0.05 |

build: fc677991 (3297)
[jirka@Sidewinder llama.cpp]$ git branch
* caching_device_info
  master

OuadiElfarouki · 2024-07-05T11:28:49Z

Closing this as another fix tackling the same issue got merged first : #8286

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jul 4, 2024

Caching device_info in device_ext to avoid repetitive queries

8055548

OuadiElfarouki force-pushed the caching_device_info branch from 3424fce to 8055548 Compare July 4, 2024 15:30

typo fix

fc67799

OuadiElfarouki closed this Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Caching device_info in device_ext to restore TG performance #8301

[SYCL] Caching device_info in device_ext to restore TG performance #8301

OuadiElfarouki commented Jul 4, 2024

ngxson commented Jul 4, 2024

OuadiElfarouki commented Jul 4, 2024

mistrjirka commented Jul 4, 2024

OuadiElfarouki commented Jul 4, 2024 •

edited

Loading

airMeng commented Jul 5, 2024

OuadiElfarouki commented Jul 5, 2024

NeoZhangJianyu commented Jul 5, 2024

mistrjirka commented Jul 5, 2024

OuadiElfarouki commented Jul 5, 2024

[SYCL] Caching device_info in device_ext to restore TG performance #8301

[SYCL] Caching device_info in device_ext to restore TG performance #8301

Conversation

OuadiElfarouki commented Jul 4, 2024

ngxson commented Jul 4, 2024

OuadiElfarouki commented Jul 4, 2024

mistrjirka commented Jul 4, 2024

OuadiElfarouki commented Jul 4, 2024 • edited Loading

airMeng commented Jul 5, 2024

OuadiElfarouki commented Jul 5, 2024

NeoZhangJianyu commented Jul 5, 2024

mistrjirka commented Jul 5, 2024

OuadiElfarouki commented Jul 5, 2024

OuadiElfarouki commented Jul 4, 2024 •

edited

Loading