Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rwkv.cpp server #17

Open
abetlen opened this issue Apr 7, 2023 · 8 comments
Open

rwkv.cpp server #17

abetlen opened this issue Apr 7, 2023 · 8 comments

Comments

@abetlen
Copy link

abetlen commented Apr 7, 2023

Hi @saharNooby, first off amazing work in this repo, I've been looking for a cpu implementation of RWKV to experiment with using the pre-trained models (don't have a large gpu).

I've put together a basic port of my OpenAI-compatible webserver from llama-cpp-python and tested it on Linux with the your library and the RWKV Raven 3B model in f16, q4_0, and q4_1 (pictured below). Going to try some larger models this weekend to test performance / quality. The cool thing about exposing the model through this server is that It opens the project up to be connected to any OpenAI client (langchain, chatui's, multi-language client libraries).

Let me know if you want me to put a PR to merge this in somewhere and if so the best place to put it.

Cheers, and again great work!

image

@saharNooby
Copy link
Collaborator

Hi @abetlen, thanks!

Having an OpenAI-compatible inference server is indeed great and will definetly increase usability of rwkv.cpp. I've written one for myself, but it's too crude to be open-sourced. If your server is contained in a single file and has almost no dependencies (aside from flask), I think it can be straight up put into rwkv directory in this repo.

If the server requires more structure, then I'm not sure... Having it inside a subdirectory may not work: last time I checked, Python does not like referencing py files in subdirectories. Maybe in this case rwkv.cpp should become a pip package, you can create separate repo for the server, and I link it in README here. I'm not sure how to create a pip package tho, provided that we need to build a C library for it.

Please also don't forget to check out new Q4_1_O format, which will be merged in this PR soon. Q4_0 and Q4_1, inherited from llama.cpp heavily reduce quality of RWKV, so a new format was needed -- it preserves outlier weights better and does not quantize activations. Details are in this issue. I'll also add a simple perplexity measuring script soon, so everyone can verify the results.

I can also suggest implementing a state cache, like this. The idea is to cache states by hash of prompt string, so during next inference call, we don't need to go through whole prompt again -- I can't overstate how slow it is having long conversations with chatbots on CPU without a cache. You decide tho, the cache is an additional complication after all :)

@abetlen
Copy link
Author

abetlen commented Apr 8, 2023

Thanks for the reply @saharNooby

I'll see about putting it into a single file, right now it depends on 3 packages: fastapi (framework), sse_starlette (handle server-sent events), and uvicorn (server). And thank you for sharing that cache implementation I'll be sure to integrate it (something I'm working on for the llama.cpp server actually)!

I think a pip package would be very useful, if you need any help putting that together I'd be happy to assist. I have one for my llama.cpp python bindings and the approach that I took for the server is to distribute using a subpackage (ie pip install llama-cpp-python[server]).

To handle the C library dependency I ended up using scikit-build which has support for building native shared libraries. That way when users do a pip install it builds from source on their system ensuring the proper optimisations are selected. Let me know and I can put together a PR or something to get you going in that regard and gladly share any bugfixes between the projects.

PS: Will definitely check out that new quantization format, thanks!

@ss-zheng
Copy link

Have we merged this change yet?

@saharNooby
Copy link
Collaborator

@ss-zheng As I know, adding the server requires #21 to be merged, and it is not merged yet.

But there are already OpenAI-compatible REST servers that support rwkv.cpp, like https://github.com/go-skynet/LocalAI

@ss-zheng
Copy link

ss-zheng commented Jun 3, 2023

Great thanks for pointing me to it!

@alienatorZ
Copy link

I think it would be great to have a straight python server rather than LocalAI's docker build process.

@edbarz9
Copy link

edbarz9 commented Jul 18, 2023

Hi @abetlen, thanks!

Having an OpenAI-compatible inference server is indeed great and will definetly increase usability of rwkv.cpp. I've written one for myself, but it's too crude to be open-sourced. If your server is contained in a single file and has almost no dependencies (aside from flask), I think it can be straight up put into rwkv directory in this repo.

If the server requires more structure, then I'm not sure... Having it inside a subdirectory may not work: last time I checked, Python does not like referencing py files in subdirectories. Maybe in this case rwkv.cpp should become a pip package, you can create separate repo for the server, and I link it in README here. I'm not sure how to create a pip package tho, provided that we need to build a C library for it.

Please also don't forget to check out new Q4_1_O format, which will be merged in this PR soon. Q4_0 and Q4_1, inherited from llama.cpp heavily reduce quality of RWKV, so a new format was needed -- it preserves outlier weights better and does not quantize activations. Details are in this issue. I'll also add a simple perplexity measuring script soon, so everyone can verify the results.

I can also suggest implementing a state cache, like this. The idea is to cache states by hash of prompt string, so during next inference call, we don't need to go through whole prompt again -- I can't overstate how slow it is having long conversations with chatbots on CPU without a cache. You decide tho, the cache is an additional complication after all :)

So, I made this fork on my git server https://git.brz9.dev/ed/rwkv.cpp with this extra flask_server.py in rwkv/.

It can be run with $ python rwkv/flask_server.py --model <path_to_model> --port 5349

Then queried by :

curl -X POST -H "Content-Type: application/json" -d '{"prompt":"Write a hello world program in python", "temperature":0.8, "top_p":0.2, "max_length":250}' http://127.0.0.1:5349/chat

This is still a work in progress but that would be a small addition to the codebase

@chymian
Copy link

chymian commented Sep 4, 2023

@saharNooby thank you for your great work.
is there any chance to get @abetlen server/API implemented?
IMHO that would be a great gain for an easy openAI compatible API, without the burden++ of the LocalAI build process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants