Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Caching device_info in device_ext to restore TG performance #8301

Closed

Conversation

OuadiElfarouki
Copy link
Collaborator

This patch introduces a caching mechanism within device_ext to persist the device_info-s within the dev_mgr singleton instance.
This particularly lifts the runtime overhead introduced by the current get_work_group_size implementation that triggers unnecessary device prop queries only to grab the wg size, namely for norm and softmax operations.

Performance of Text Generation specifically got recovered :
Nvidia A100 + 70B Q4_K : 2.0 t/s -> 10.5 t/s (~ x5)
Nvidia A100 + 13B Q4_K : 4.25 t/s -> 42.5 t/s (~ x10)

Pinging @joeatodd @airMeng @AidanBeltonS @abhilash1910

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jul 4, 2024
@ngxson
Copy link
Collaborator

ngxson commented Jul 4, 2024

Probably related to #8286

@OuadiElfarouki
Copy link
Collaborator Author

Probably related to #8286

Yes indeed there has been some overlap there. Will see how we can sort this out.

@mistrjirka
Copy link

This patch introduces a caching mechanism within device_ext to persist the device_info-s within the dev_mgr singleton instance. This particularly lifts the runtime overhead introduced by the current get_work_group_size implementation that triggers unnecessary device prop queries only to grab the wg size, namely for norm and softmax operations.

Performance of Text Generation specifically got recovered : Nvidia A100 + 70B Q4_K : 2.0 t/s -> 10.5 t/s (~ x5) Nvidia A100 + 13B Q4_K : 4.25 t/s -> 42.5 t/s (~ x10)

Pinging @joeatodd @airMeng @AidanBeltonS @abhilash1910

* [x]  I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)

* Self-reported review complexity:
  
  * [ ]  Low
  * [x]  Medium
  * [ ]  High

I attempted to compile and try it on my integrated Intel GPU. It compiles but it does not work on inference. I tried command
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-bench -o md -m ~/llmmodels/Lexi-Llama-3-8B-Uncensored_Q4_K_M.gguf -n 16 -p 256 -ngl 33 -sm none -mg 0 but it started to fail on some assert:
log.txt

@OuadiElfarouki
Copy link
Collaborator Author

OuadiElfarouki commented Jul 4, 2024

Thanks for flagging that @mistrjirka, we haven't tested the recent LLama 3 on our SYCL backend end yet. Is this is an error you get on the master branch as well or is it introduced in this patch specifically?

@airMeng
Copy link
Collaborator

airMeng commented Jul 5, 2024

I think the solution of #8286 is more explicit, ggml_sycl_info() is exactly designed to query these specs

@OuadiElfarouki
Copy link
Collaborator Author

Hi @airMeng
I see that the fix in #8286 tackles the issue directly as a work-around.
Still I don't agree that ggml_sycl_info() is meant to hold all SYCL interface related properties whenever we need one, but instead the bare hardware's compute/memory capabilities that shall help llama.cpp decide on certain execution paths for operators that support or not certain hardware features for instance (that's why we have the CC & the vram size there for example).
The wg_size on the other hand serves as a kernel configuration, kernels and paths being the same, so it's more like an adaptive performance tuning information.

@NeoZhangJianyu
Copy link
Collaborator

max_work_group_size is an integer data.
for device, it's a fixed value.

So, in PR #8286, I use the local cache ( int array) in the static class ggml_sycl_info():

  • detect all devices' max_work_group_size and save them in int array during startup,
  • access it by ctx.device which is integer too.

This solution is a formal solution, instead of work around.
It's simple and quickly.
We could use similar solution for all other device info.

Because the static class instance ggml_sycl_info() is const, and all access are read only.
no need to add mux(), that avoid performance reduce.

In this PR, use complex code with mux().
So, ... :)

@mistrjirka
Copy link

Thanks for flagging that, @mistrjirka; we haven't tested the recent LLama 3 on our SYCL backend end yet. Is this is an error you get on the master branch as well or is it introduced in this patch specifically?

I am sorry I accidentally was on a wrong branch. I tested it on the correct branch now, and I can confirm that it works without problem. I do not observe any performance improvement, but I suppose that is because A100 and Iris Xe are quite different GPUs.

[jirka@Sidewinder llama.cpp]$ ./build/bin/llama-bench -o md -m ~/llmmodels/Lexi-Llama-3-8B-Uncensored_Q4_K_M.gguf -n 16 -p 256 -ngl 33 -sm none -mg 0
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 4 SYCL devices:
| model                          |       size |     params | backend    | ngl |    sm |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | ------------: | ---------------: |
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 4 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                 Intel Iris Xe Graphics|    1.3|     96|     512|   32| 15083M|            1.3.29377|
| 1|     [opencl:gpu:0]|                 Intel Iris Xe Graphics|    3.0|     96|     512|   32| 15083M|          24.17.29377|
| 2|     [opencl:cpu:0]|11th Gen Intel Core i7-1165G7 @ 2.80GHz|    3.0|      8|    8192|   64| 16547M|2023.16.10.0.17_160000|
| 3|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|      8|67108864|   64| 16547M|2023.16.10.0.17_160000|
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | SYCL       |  33 |  none |         pp256 |     27.08 ± 0.29 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | SYCL       |  33 |  none |          tg16 |      3.53 ± 0.05 |

build: fc677991 (3297)
[jirka@Sidewinder llama.cpp]$ git branch
* caching_device_info
  master

@OuadiElfarouki
Copy link
Collaborator Author

Closing this as another fix tackling the same issue got merged first : #8286

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants