Server: when no slot is available, defer the task instead of returning "slot unavailable" #5018

ngxson · 2024-01-18T13:27:09Z

Motivation

Assuming that there is only one slot in server mode, when trying to send 2 requests at the same time, one request will fail with "slot unavailable" error. This behavior sometimes breaks OpenAI compatibility.

This PR defer the task until one of the slots is available.

On the bright side, request will no longer fail. But on the down side, one request now need to wait for the other one to finish.

lemmi · 2024-01-18T15:09:57Z

I think having a queue is a good idea, but it probably shouldn't be an unbounded queue.

ngxson · 2024-01-18T17:02:41Z

I think having a queue is a good idea, but it probably shouldn't be an unbounded queue.

I agree with that. In fact, I suspect that the complexity of the server code comes from the communication between http server thread and the "worker" thread (the one who runs the model).

Nevertheless, having used boost::asio in the product of my company, I'm pretty sure that an async-like approach will the the best. Ideally, we can even remove the queue and all the mutexes.

But that mean re-writing all the server code from zero, and for now I really don't have the time to do so.

ggerganov

Good change

Probably you want to std::move(task) to avoid copies

* server: defer task when no slot is available * remove unnecessary log --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

Xuan Son Nguyen added 2 commits January 18, 2024 14:19

server: defer task when no slot is available

bf0daf4

remove unnecessary log

558cd1d

ngxson mentioned this pull request Jan 18, 2024

Tasks queue logic doesn't seem to be logical #5000

Closed

4 tasks

ggerganov approved these changes Jan 18, 2024

View reviewed changes

ggerganov merged commit 821f0a2 into ggerganov:master Jan 18, 2024
39 of 44 checks passed

This was referenced Jan 19, 2024

server : improvements and maintenance #4216

Open

Server: try to refactor server.cpp #5065

Merged

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

server : defer tasks when "slot unavailable" (ggerganov#5018)

4fd7ed5

* server: defer task when no slot is available * remove unnecessary log --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server: when no slot is available, defer the task instead of returning "slot unavailable" #5018

Server: when no slot is available, defer the task instead of returning "slot unavailable" #5018

ngxson commented Jan 18, 2024

lemmi commented Jan 18, 2024

ngxson commented Jan 18, 2024

ggerganov left a comment

Server: when no slot is available, defer the task instead of returning "slot unavailable" #5018

Server: when no slot is available, defer the task instead of returning "slot unavailable" #5018

Conversation

ngxson commented Jan 18, 2024

Motivation

lemmi commented Jan 18, 2024

ngxson commented Jan 18, 2024

ggerganov left a comment

Choose a reason for hiding this comment