-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : add batched inference endpoint to server #3478
Comments
Since there are many efficient quantization levels in llama.cpp, adding batch inference and continuous batching to the server will make it highly competitive with other inference frameworks like vllm or hf-tgi. |
Yes, vllm and agi seem to be not available on windows。with transformers a batch of 10 sequences costs about 25 seconds, i think it would just costs 15 seconds if with llama.cpp. i have no idea cause i have not tested successfully. |
I would also be interested in this one |
Thank you @ggerganov for adding this feature. Maybe I am missing something; when I give an array of prompts to the request as described in the README, I will get the response only for the last element of the array instead of for each prompt in the array. Here is an example to reproduce: Server:
Client:
Output:
I am on main (8e672ef) Any idea? thank you! |
Did you ever solve this? I'm running into the same issue, I assume its a formatting error? |
Ah, I think I got confused. We solved serving clients in parallel, but not processing prompts in parallel. |
Thank you for your response. Makes sense. Last question. I ran some benchmarks early last week using the workaround you described (submitting prompts in separate requests). The benchmarks were done on CPU only with OPENBLAS with a I observed the following results running a Mistral model as follow:
Thanks again. |
I think for quantum models, using OpenBLAS should be slower. make clean && make -j
./batched-bench mistral-7b-instruct-v0.1.Q4_0.gguf 8192 0 0 0 128,256,512 64,128,256 1,2,3,4
make clean && LLAMA_OPENBLAS=1 make -j
./batched-bench mistral-7b-instruct-v0.1.Q4_0.gguf 8192 0 0 0 128,256,512 64,128,256 1,2,3,4 |
Here are the results with a
For the use case I was benchmarking, my prompt was much longer than the generated response, so it might be similar to the scenario |
You can try the OpenBLAS bench with this PR: #4240 because currently on So based on the results, indeed the performance does not scale with more batches on these machines (the git clone https://github.com/ggerganov/whisper.cpp && cd whisper.cpp && make -j bench && ./bench -w 1 -t 8 It can take about a minute or two to run. |
Thanks for the explanation @ggerganov and for continuing to look into it; it's super helpful. Here is the output using the same instance:
|
And here is the output with OpenBLAS bench using the PR: #4240. The results definitely look better.
|
If you disable diff --git a/examples/batched-bench/batched-bench.cpp b/examples/batched-bench/batched-bench.cpp
index 533c55c..277c901 100644
--- a/examples/batched-bench/batched-bench.cpp
+++ b/examples/batched-bench/batched-bench.cpp
@@ -89,6 +89,7 @@ int main(int argc, char ** argv) {
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = n_gpu_layers;
+ model_params.use_mmap = false;
llama_model * model = llama_load_model_from_file(params.model.c_str(), model_params);
@@ -104,8 +105,8 @@ int main(int argc, char ** argv) {
ctx_params.n_batch = 512;
ctx_params.mul_mat_q = mmq;
- ctx_params.n_threads = params.n_threads;
- ctx_params.n_threads_batch = params.n_threads_batch == -1 ? params.n_threads : params.n_threads_batch;
+ ctx_params.n_threads = 16;
+ ctx_params.n_threads_batch = 16;
llama_context * ctx = llama_new_context_with_model(model, ctx_params);
Though I would have expected it to scale better with the batch size. Not sure -- maybe I'm still missing something. Btw, I also tried a similar Arm-based instance:
For comparison, here is how it scales on my
Note how the TG time for 1,2,3,4 batches is almost constant - this is what we normally want. |
Excellent, thank you @ggerganov, for sharing these findings. I will then focus my efforts on Arm-bases instances. |
for those not familiar with C like me.
it would be great if a new endpoint added to server.cpp to make batch inference.
for example:
endpoint: /completions
post: {"prompts":["promptA","promptB","promptC"]}
response:{"results":["sequenceA","sequenceB","sequenceC"]}
it is easy to do so with Hugging Face Transformers (as i do right now), but it's quite inefficient,hope to use llama.cpp to increase the efficiency oneday, cause I am not familiar with C, so can not use baby llama. I can only use javascript to Interact data with server.cpp。
The text was updated successfully, but these errors were encountered: