-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Towards a C++ library #36
Comments
I 100% agree with this. This is also what I intend for this project to be. Loading once should be enough, and there must be an option to cache key & value cache to avoid re-computation in a multi-turn (chat-style) mode. |
A lot has happened on the llama.cpp repo:
Currently, in the cformers repo there is only one Makefile for build, which is only supported on POSIX systems. We could add a CMakeLists like in the llama.cpp repo. But these build files to maintain for different OS. I found not that practical, and I used for some time XMake, it's an alternative to CMake with lua scripting. Small example (already working with cformers code): add_rules("mode.debug", "mode.release")
set_languages("cxx11", "c11")
target("cformers")
set_kind("$(kind)")
set_default(true)
add_files("src/**.cpp")
add_files("src/**.c")
if is_plat("linux") then
add_syslinks("pthread")
add_cflags("-D_POSIX_C_SOURCE=199309L")
end
add_headerfiles("include/**.h")
add_includedirs("include", {public = true})
target("quantize_bloom")
set_kind("binary")
add_files("quantize/quantize_bloom.cpp")
add_deps("cformers") The quantization program could be run with: It has another advantages, which is it can use package (700+ on xrepo). I noticed that llama.cpp supports OpenBLAS so with xmake it could be like this: add_requires("openblas")
target("ggml")
set_kind("static")
add_packages("openblas") I know that it can be difficult to start with a new tool, but I feel like it's easier to get started with than CMake. It's really a pleasure to working with. I have a setup script with 170 lines of code, it downloads some models convert them to their C++ version, install python, ... What do you think ? |
This project seems to use pybindings to not have to load the model into memory each time. Taking inspiration from the work there may be a good idea. |
Development of roadmap ideas:
I'm particularly interested in this project for a C++ library. This could allow to multiple project to use the code.
I feel like this worth mentioning CTranslate2 which is a C++ library for mainly translator transformers but also does text generation with BLOOM or OPT.
Anyway, I found the current project structure not very practical for this task. So I propose to move all the C files to a
src
andinclude
directories in the root folder of the repo. This would allow simplifying the usage/compilation of the C backend.Currently, the model is loaded every time a prompt is submitted, which slows down the process. Thus, instead of using an executable program, an API could be used along with pybind11 to enhance the performance of the model.
That API might look something like this:
The text was updated successfully, but these errors were encountered: