Prospective project Idea for GSOC 2024: Integration with llama.cpp and ggml #22382

AlexFierro9 · 2024-01-24T11:52:08Z

AlexFierro9
Jan 24, 2024

As part of a ideas list for GSOC 2024, I would like to propose an integration with the ggml library used in llama.cpp, it allows for simultaneous CPU + GPU inference and a SYCL based backend already in the final stages. Using this will allow developers to utilize both CPU and GPU concurrently for high demand LLMs.

I propose that we integrate the SYCL backend with a fully configured front-end.

adrianboguszewski · 2024-01-25T15:45:27Z

adrianboguszewski
Jan 25, 2024
Collaborator

Thanks @AlexFierro9.

@mlukasze who from our developers would be interested in that kind of project?

0 replies

mlukasze · 2024-01-26T06:14:52Z

mlukasze
Jan 26, 2024
Collaborator

@peterchen-intel could you take a look, please?

0 replies

vshampor · 2024-02-01T12:46:26Z

vshampor
Feb 1, 2024
Collaborator

Greetings, @AlexFierro9!

llama.cpp integration sounds interesting - could you please outline your idea in some more detail? I gather that the SYCL implementation in the PR you linked would be serving as the compute backend. Do you propose that OpenVINO would have to provide the frontend functionality? If so, do you already have any ideas about the manner in which OpenVINO would be serving as a frontend?

0 replies

AlexFierro9 · 2024-02-01T17:45:26Z

AlexFierro9
Feb 1, 2024
Author

Hi @vshampor !

The gguf files are gaining traction in community, mostly because they allow one to run LLMs upto 70B in as little as 32GB ram and also since the inference is in C++ ; the inference is inherently faster.

Since the gguf file structure is not compatible with OpenVino's IR format, I think it might be possible to add support for this files separately, and since the llama.cpp officially supports SYCL it is possible to send the support/ send the whole Model to GPU itself.

What I propose is that we create bindings from llama.cpp to OpenVino and give the user 3 options:

1.) Normal Backend with no optimization
2.) SYCL Backend (GPU + CPU usage)
3.) Intel MKL Backend(CPU only)

The front end will look something like this
From OV.llama_cpp import Prompt
`repsonse = Prompt(model = model_path, //All other configurations that llama.cpp allows//, backend = // can be any of the 3 choices//,
n_gpu = //number of layers to be sent to GPU, can be ignored if SYCL backend is not used//)

// Prompt will call on the main backend and store the response there
`
As far as I've read so far, the 2nd and 3rd backends cannot be combined so far, also the SYCL backend does not build on windows for now, though it is the next thing on the list for contributors.

As far as implementing the backend is concerned, there are already python bindings for llama.cpp but they do not allow for much customization when setting up the backend.

I can elaborate more on this as needed, but I think this explains the core idea for my proposal.

Thoughts?

0 replies

Aryan8912 · 2024-02-05T11:53:16Z

Aryan8912
Feb 5, 2024

This idea is very good can I work on this idea with you

2 replies

adrianboguszewski Feb 5, 2024
Collaborator

Hi @Aryan8912. If we talk about Google Summer of Code, then only one person can work on a project.

AlexFierro9 Feb 5, 2024
Author

Also @vshampor hasn't really reverted about this yet🥲

AlexFierro9 · 2024-02-05T12:48:55Z

AlexFierro9
Feb 5, 2024
Author

@vshampor, thought?

0 replies

vshampor · 2024-02-05T17:24:48Z

vshampor
Feb 5, 2024
Collaborator

@AlexFierro9

What they did in the PR you linked was basically to implement the bool ggml_sycl_compute_forward(struct ggml_compute_params * params, struct ggml_tensor * tensor); which dispatches the correct DNN operation based on params and tensor, ultimately resulting in a SYCL/MKL-powered execution of functions like https://github.com/ggerganov/llama.cpp/blob/78b00dda6c0d62c34f5371d47718defff6ed2b22/ggml.c#L8258. To implement that, you need to have an executor for the atomic operations (primitives) in the model's execution/control flow graph, such as libsycl or MKL.

However, AFAIK OpenVINO does not provide an interface to the execution primitives of its plugins (either CPU, GPU, or the rest). In fact, I think that the OpenVINO plugins themselves reuse the same MKL/SYCL primitives. So if you expected to do the same stuff as in the PR with OpenVINO, then it wouldn't even be technically feasible.

OV is rather targeted at executing holistic models with a known execution graph, achieving performance in part due to optimizations across the execution graph nodes. I suppose that one could imagine building, instantiating and executing single-primitive ov::Model objects in the hypothetical ggml_openvino_compute_forward to achieve the same functional result, but I doubt that there would be any performance improvement over the usage of the primitives directly.

Another hypothetical option would be to execute the entire graph with OpenVINO by creating an ov::Model directly from GGUF representation and then inferring it with the OpenVINO runtime. However, you would then not be able to use any GGML-specific quantization schemes and be only limited to what is provided by OpenVINO opset. You would also need to devise a method of building an IR from GGUF (which would involve hand-coding all of the LLM architectures) or by intercepting a ‎ggml_cgraph and working from there - but I suppose that losing the quantization schemes would be a no-go for this path anyway.

0 replies

AlexFierro9 · 2024-02-06T10:26:02Z

AlexFierro9
Feb 6, 2024
Author

@vshampor

I agree with your point.

Another hypothetical option would be to execute the entire graph with OpenVINO by creating an ov::Model directly from GGUF representation and then inferring it with the OpenVINO runtime. However, you would then not be able to use any GGML-specific quantization schemes and be only limited to what is provided by OpenVINO opset. You would also need to devise a method of building an IR from GGUF (which would involve hand-coding all of the LLM architectures) or by intercepting a ‎ggml_cgraph and working from there - but I suppose that losing the quantization schemes would be a no-go for this path anyway

Perhaps we can look into building an wrapper in OpenVINO for this?'

Each time the user installs Open VINO it automatically also builds llama.cpp with the 3 backend options, and each time the wrapper is called it starts the llama.cpp installed earlier, captures the output and then presents/streams it to the user?

I have personally used both Open Vino and llama.cpp together on a few projects and getting llama.cpp compiled in the first place has been a lot of issue so far (even with llama-cpp-python) and almost every time I wished that OpenVINO supported it in some way.

This way we allow OpenVINO users to use another popular format without needing to work with other libraries, which can potentially make development a lot more streamlined.

5 replies

AlexKoff88 Feb 6, 2024
Collaborator

@AlexFierro9,

we are planning to have a community-developed plugin for transformer inference that is based on GGML. The idea is that we can convert OpenVINO IR into GGUF or use prepared GGUF as an internal plugin representation (cached representation). The plugin will be pushed into openvino-contrib repo and should be built separately. My expectation is that we can automate the building process as much as possible and validate it on a regular basis using GitHub Actions. So, it will be not a problem to build such a plugin (shared library) and use it with pre-installed OpenVINO.
Plugin options (backends) can be controlled by CMake at build time and device selection can be organized via device config (see API 2.0).

I hope that you and @Aryan8912 can help us with our contribution and scale it to 3 devices.
I wonder if @vshampor can share any code at that point to understand the details.

AlexFierro9 Feb 6, 2024
Author

@AlexKoff88
Count me in please
When do we begin?

adrianboguszewski Feb 7, 2024
Collaborator

@AlexFierro9, the project has been published as GSoC project under the number of 17
https://github.com/openvinotoolkit/openvino/wiki/Google-Summer-Of-Code#project-ideas-for-2024

Please consider joining GSoC to work on this :)

Aryan8912 Feb 7, 2024

it's a solo project or with team OpenVINO Extension for Automatic1111 Stable Diffusion WebUI

adrianboguszewski Feb 7, 2024
Collaborator

All of the GSoC projects are solo.

Aryan8912 · 2024-02-06T11:52:12Z

Aryan8912
Feb 6, 2024

Ya sure, it's my pleasure

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prospective project Idea for GSOC 2024: Integration with llama.cpp and ggml #22382

{{title}}

Replies: 9 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Prospective project Idea for GSOC 2024: Integration with llama.cpp and ggml #22382

AlexFierro9 Jan 24, 2024

Replies: 9 comments · 7 replies

adrianboguszewski Jan 25, 2024 Collaborator

mlukasze Jan 26, 2024 Collaborator

vshampor Feb 1, 2024 Collaborator

AlexFierro9 Feb 1, 2024 Author

Aryan8912 Feb 5, 2024

adrianboguszewski Feb 5, 2024 Collaborator

AlexFierro9 Feb 5, 2024 Author

AlexFierro9 Feb 5, 2024 Author

vshampor Feb 5, 2024 Collaborator

AlexFierro9 Feb 6, 2024 Author

AlexKoff88 Feb 6, 2024 Collaborator

AlexFierro9 Feb 6, 2024 Author

adrianboguszewski Feb 7, 2024 Collaborator

Aryan8912 Feb 7, 2024

adrianboguszewski Feb 7, 2024 Collaborator

Aryan8912 Feb 6, 2024

AlexFierro9
Jan 24, 2024

Replies: 9 comments 7 replies

adrianboguszewski
Jan 25, 2024
Collaborator

mlukasze
Jan 26, 2024
Collaborator

vshampor
Feb 1, 2024
Collaborator

AlexFierro9
Feb 1, 2024
Author

Aryan8912
Feb 5, 2024

adrianboguszewski Feb 5, 2024
Collaborator

AlexFierro9 Feb 5, 2024
Author

AlexFierro9
Feb 5, 2024
Author

vshampor
Feb 5, 2024
Collaborator

AlexFierro9
Feb 6, 2024
Author

AlexKoff88 Feb 6, 2024
Collaborator

AlexFierro9 Feb 6, 2024
Author

adrianboguszewski Feb 7, 2024
Collaborator

adrianboguszewski Feb 7, 2024
Collaborator

Aryan8912
Feb 6, 2024