Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] offload op #6217

Merged
merged 5 commits into from
Mar 24, 2024
Merged

[SYCL] offload op #6217

merged 5 commits into from
Mar 24, 2024

Conversation

airMeng
Copy link
Collaborator

@airMeng airMeng commented Mar 22, 2024

According to #5277 (reply in thread), the PR does the following:

  1. Leave the scheduler to ggml_backend_sched entirely
  2. Since SYCL doesn't support registering host memory, recommending to use USM instead, remove all non-USM code by the way

results:

$:~/llama.cpp/build$ ./bin/llama-bench -n 0 -ngl 0 -m ~/llama-2-7b.Q4_0.gguf -mmp 0
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: yes
found 4 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|              Intel(R) Arc(TM) A730M Graphics|       1.3|        384|    1024|     32|    12160962560|
| 1|    [opencl:gpu:0]|              Intel(R) Arc(TM) A730M Graphics|       3.0|        384|    1024|     32|    12160962560|
| 2|    [opencl:cpu:0]|     Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz|       3.0|          6|    8192|     64|    16498610176|
| 3|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|          6|67108864|     64|    16498610176|
| model                          |       size |     params | backend    | ngl |       mmap | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:384
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |   0 |          0 | pp 512     |    341.90 ± 0.31 |

build: 4b9f3b43 (2493)
$:~/llama.cpp/build$ ./bin/llama-bench -n 0 -ngl 33 -m ~/llama-2-7b.Q4_0.gguf -mmp 0
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: yes
found 4 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|              Intel(R) Arc(TM) A730M Graphics|       1.3|        384|    1024|     32|    12160962560|
| 1|    [opencl:gpu:0]|              Intel(R) Arc(TM) A730M Graphics|       3.0|        384|    1024|     32|    12160962560|
| 2|    [opencl:cpu:0]|     Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz|       3.0|          6|    8192|     64|    16498610176|
| 3|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|          6|67108864|     64|    16498610176|
| model                          |       size |     params | backend    | ngl |       mmap | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:384
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | SYCL       |  33 |          0 | pp 512     |    851.59 ± 0.78 |

build: 4b9f3b43 (2493)

@airMeng
Copy link
Collaborator Author

airMeng commented Mar 23, 2024

@slaren In fact still not quite understand, I think you want ngl 0 and ngl 33 to be switched smoothly without a specific device selection, please correct me if wrong.

We work in the spare time, so the response might be slow, please forgive too.

@slaren
Copy link
Collaborator

slaren commented Mar 23, 2024

I would expect that with -ngl 0, the fastest accelerator available would be added to the list of backends in llama.cpp, so that it can be used to offload the computation of large batches. Probably it should be the same device that would be chosen to use in single GPU mode. The performance should increase gradually as more layers are offloaded, but with no layers offloaded it should still be significantly faster than with the CPU alone (see the graphs in #6083).

The changes look good. The call to ggml_init_sycl and should also be removed from ggml.c, and ggml-sycl.h should not be included in ggml.c. Instead, the backend should do its initialization the first time any of its functions are called. The goal is to remove all code from the backends in ggml.c.

@airMeng
Copy link
Collaborator Author

airMeng commented Mar 24, 2024

The changes look good. The call to ggml_init_sycl and should also be removed from ggml.c, and ggml-sycl.h should not be included in ggml.c. Instead, the backend should do its initialization the first time any of its functions are called. The goal is to remove all code from the backends in ggml.c

Done in 5f8a87d

Copy link
Collaborator

@abhilash1910 abhilash1910 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work to use usm.

@airMeng airMeng merged commit ddf6568 into master Mar 24, 2024
58 checks passed
@airMeng airMeng deleted the sycl-offload-op branch March 25, 2024 07:27
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* remove no USM methods

* leave the schedule to ggml_backend_sched entirely
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
* remove no USM methods

* leave the schedule to ggml_backend_sched entirely
tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
* remove no USM methods

* leave the schedule to ggml_backend_sched entirely
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants