Prospective project Idea for GSOC 2024: Integration with llama.cpp and ggml #22382
Replies: 9 comments 7 replies
-
Thanks @AlexFierro9. @mlukasze who from our developers would be interested in that kind of project? |
Beta Was this translation helpful? Give feedback.
-
@peterchen-intel could you take a look, please? |
Beta Was this translation helpful? Give feedback.
-
Greetings, @AlexFierro9! llama.cpp integration sounds interesting - could you please outline your idea in some more detail? I gather that the SYCL implementation in the PR you linked would be serving as the compute backend. Do you propose that OpenVINO would have to provide the frontend functionality? If so, do you already have any ideas about the manner in which OpenVINO would be serving as a frontend? |
Beta Was this translation helpful? Give feedback.
-
Hi @vshampor ! The gguf files are gaining traction in community, mostly because they allow one to run LLMs upto 70B in as little as 32GB ram and also since the inference is in C++ ; the inference is inherently faster. Since the gguf file structure is not compatible with OpenVino's IR format, I think it might be possible to add support for this files separately, and since the llama.cpp officially supports SYCL it is possible to send the support/ send the whole Model to GPU itself. What I propose is that we create bindings from llama.cpp to OpenVino and give the user 3 options: 1.) Normal Backend with no optimization The front end will look something like this // Prompt will call on the main backend and store the response there As far as implementing the backend is concerned, there are already python bindings for llama.cpp but they do not allow for much customization when setting up the backend. I can elaborate more on this as needed, but I think this explains the core idea for my proposal. Thoughts? |
Beta Was this translation helpful? Give feedback.
-
This idea is very good can I work on this idea with you |
Beta Was this translation helpful? Give feedback.
-
@vshampor, thought? |
Beta Was this translation helpful? Give feedback.
-
What they did in the PR you linked was basically to implement the However, AFAIK OpenVINO does not provide an interface to the execution primitives of its plugins (either CPU, GPU, or the rest). In fact, I think that the OpenVINO plugins themselves reuse the same MKL/SYCL primitives. So if you expected to do the same stuff as in the PR with OpenVINO, then it wouldn't even be technically feasible. OV is rather targeted at executing holistic models with a known execution graph, achieving performance in part due to optimizations across the execution graph nodes. I suppose that one could imagine building, instantiating and executing single-primitive Another hypothetical option would be to execute the entire graph with OpenVINO by creating an |
Beta Was this translation helpful? Give feedback.
-
I agree with your point.
Perhaps we can look into building an wrapper in OpenVINO for this?' Each time the user installs Open VINO it automatically also builds llama.cpp with the 3 backend options, and each time the wrapper is called it starts the llama.cpp installed earlier, captures the output and then presents/streams it to the user? I have personally used both Open Vino and llama.cpp together on a few projects and getting llama.cpp compiled in the first place has been a lot of issue so far (even with llama-cpp-python) and almost every time I wished that OpenVINO supported it in some way. This way we allow OpenVINO users to use another popular format without needing to work with other libraries, which can potentially make development a lot more streamlined. |
Beta Was this translation helpful? Give feedback.
-
Ya sure, it's my pleasure |
Beta Was this translation helpful? Give feedback.
-
As part of a ideas list for GSOC 2024, I would like to propose an integration with the ggml library used in llama.cpp, it allows for simultaneous CPU + GPU inference and a SYCL based backend already in the final stages. Using this will allow developers to utilize both CPU and GPU concurrently for high demand LLMs.
I propose that we integrate the SYCL backend with a fully configured front-end.
Beta Was this translation helpful? Give feedback.
All reactions