Server: try to refactor server.cpp #5065

ngxson · 2024-01-21T14:11:52Z

Motivation

The current server.cpp code is quite complicated to understand. This is my first trial to refactor it.

I don't even know if I'm heading in the right direction or not. I really need some helps and I'd appreciate all the suggestions and critics.

Here's what I succeeded to do so far:

Move all oai-compat functions to oai.hpp
Move utilities and type definitions to utils.hpp
Wrap the queue_tasks into llama_server_queue interface, which will manage its own mutex (the idea it is not to expose any mutex to the outside of that class)
Wrap queue_results into llama_server_response_event. My initial idea was to register one callback per HTTP stream, but I didn't success because the set_chunked_content_provider function still being done in blocking way.

My idea in the long term is:

Continue to break the code into smaller files. Ideally, we can only have argument parsers and some high-level calls in server.cpp
Rewrite the task system, so that we can get rid of the llama_server_queue.on_all_tasks_finished
Unify multitask and task into one place, where we can do something like "update this multitask when this task is finished"
Unify slot and task into one place, where we can do something like "defer the task to run when a slot is available"

ggerganov · 2024-01-22T11:41:39Z

examples/server/utils.hpp

+    typedef std::function<void(int, int, task_result&)> callback_multitask_t;
+    callback_multitask_t callback_update_multitask;
+    // for keeping track of all tasks waiting for the result
+    std::mutex mutex_task_ids;


This mutex is not needed.

In general, server should never have more than 2 mutexes - one for input and one for output. If we add a 3rd one, then something is not right

You're right, thanks for the notice. I'll see what I can do to remove that mutex_task_ids

ggerganov · 2024-01-22T11:46:22Z

examples/server/server.cpp

-    std::condition_variable condition_tasks;
-    std::mutex mutex_results;
-    std::condition_variable condition_results;
+    std::mutex mutex_multitasks;


Refactoring all synchronization and queues into utils.hpp is OK, but the result should be that there are no longer any mutexes or locks in server.cpp

Can we avoid this somehow?

Sure, and I suspect that queue_multitasks may not even need mutex.

The reason why we needed mutex was because queue_multitasks is modified by both main thread and http thread (by calling llama.next_result)

But my new implementation of llama.next_result (moved to llama_server_response_event.recv) does not touch queue_multitasks anymore. Instead, the multitask queue is updated via llama_server_response_event.send, which is always called by main thread.

But ideally, I want multitask & task to be in the same place, i.e. maybe add it to the new struct llama_server_queue

multitasks is moved into llama_server_queue in d87b48f and 1bd8678

I didn't have many idea to test it though, but seems to work just fine:

How I tested it: Send POST to http://localhost:8080/completion

Body:

{ "prompt": ["hi", "bonjour"], "n_predict": 10 }

Response (expected 1 response in english and 1 in french):

{ "results": [ { "content": ", 2018)\n\nSangam (2019", ... }, { "content": " le chôme-Ouest, b", ... } ] }

Another test using OAI /v1/chat/completions and stream: true also works fine.

I didn't have time to test embedding API, but I believe that it should work.

ngxson · 2024-01-22T21:48:21Z

examples/server/server.cpp

-        // copy aggregate results of complete multi-tasks to the results queue
-        std::lock_guard<std::mutex> lock_results(mutex_results);
-        queue_results.insert(queue_results.end(), agg_results.begin(), agg_results.end());
+        result.result_json = json{ { "results", result_jsons } };


Side note: I also correct the response data of /completion endpoint (root should be an object result: ..., not an array). I think because most people use the OAI endpoint instead, so no one else has noticed the bug 😄

ngxson · 2024-01-22T22:09:11Z

examples/server/utils.hpp

+    }
+
+    // Call when the state of one slot is changed
+    void notify_slot_changed() {


This is optimized version of #5018 . Instead of pushing the deferred tasks back every iteration of main loop (brute force), here I only push them back when a slot change its state.

RafaAguilar · 2024-01-23T11:04:49Z

Early Load Test

I have used two different tools with different load approaches strategies, with both I obtained over 40 successful requests over 120 seconds, with about ~0.27 request per second as throughout, using the same body as in ReadMe in all requests.

Setup

1 Slot
Model llama-2-7b.Q8_0
Target rate: 1 request per second
Max up to 5 requests in parallel
Length 120 seconds
M1 Max

Next Steps

Add warm up period for both
Use lighter model (as suggested by @ngxson)
Slightly Randomize prompts
Try to establish a baseline
Automate (.sh) install and run

## Raw Results

Server setup:

./server -m ../llms/models/llama2/llama-2-7b.Q8_0.gguf -c 2048 -t 8 -ngl 99
{"timestamp":1706006199,"level":"INFO","function":"main","line":2295,"message":"build info","build":1960,"commit":"d083c81"}
{"timestamp":1706006199,"level":"INFO","function":"main","line":2302,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":10,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}

llama server listening at http://127.0.0.1:8080

{"timestamp":1706006199,"level":"INFO","function":"main","line":2401,"message":"HTTP server listening","port":"8080","hostname":"127.0.0.1"}
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ../llms/models/llama2/llama-2-7b.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 6.67 GiB (8.50 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  6695.84 MiB, ( 6695.91 / 49152.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      Metal buffer size =  6695.83 MiB
llm_load_tensors:        CPU buffer size =   132.81 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/rafa/repos/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  1024.00 MiB, ( 7721.47 / 49152.00)
llama_kv_cache_init:      Metal KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, ( 7721.48 / 49152.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   172.02 MiB, ( 7893.48 / 49152.00)
llama_new_context_with_model: graph splits (measure): 3
llama_new_context_with_model:      Metal compute buffer size =   172.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    12.00 MiB
Available slots:
 -> Slot 0 - max context: 2048
{"timestamp":1706006199,"level":"INFO","function":"main","line":2422,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)

Using K6:

scenarios: (100.00%) 1 scenario, 5 max VUs, 2m45s max duration (incl. graceful stop):
           * default: Up to 5 looping VUs for 2m15s over 2 stages (gracefulRampDown: 30s, gracefulStop: 30s)


     ✓ Status is 200
     ✓ Content-Type is application/json

     checks.........................: 100.00% ✓ 84       ✗ 0
     data_received..................: 79 kB   528 B/s
     data_sent......................: 9.3 kB  62 B/s
     http_req_blocked...............: avg=104.42µs min=1µs   med=2µs    max=1.1ms  p(90)=412.7µs  p(95)=432.65µs
     http_req_connecting............: avg=75.95µs  min=0s    med=0s     max=411µs  p(90)=293.39µs p(95)=388.95µs
     http_req_duration..............: avg=16.09s   min=3.51s med=17.79s max=18.23s p(90)=18.05s   p(95)=18.19s
       { expected_response:true }...: avg=16.09s   min=3.51s med=17.79s max=18.23s p(90)=18.05s   p(95)=18.19s
     http_req_failed................: 0.00%   ✓ 0        ✗ 42
     http_req_receiving.............: avg=43.59µs  min=29µs  med=40µs   max=101µs  p(90)=54µs     p(95)=65.59µs
     http_req_sending...............: avg=18.42µs  min=10µs  med=13.5µs max=48µs   p(90)=37.6µs   p(95)=40.95µs
     http_req_tls_handshaking.......: avg=0s       min=0s    med=0s     max=0s     p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=16.09s   min=3.51s med=17.79s max=18.23s p(90)=18.05s   p(95)=18.19s
     http_reqs......................: 42      0.280342/s
     iteration_duration.............: avg=16.09s   min=3.51s med=17.79s max=18.23s p(90)=18.05s   p(95)=18.19s
     iterations.....................: 42      0.280342/s
     vus............................: 1       min=1      max=5
     vus_max........................: 5       min=5      max=5


running (2m29.8s), 0/5 VUs, 42 complete and 0 interrupted iterations

Using Vegeta:

Requests      [total, rate, throughput]         47, 0.39, 0.27
Duration      [total, attack, wait]             2m13s, 1m59s, 14.036s
Latencies     [min, mean, 50, 90, 95, 99, max]  142.042µs, 10.176s, 14.011s, 14.037s, 14.038s, 14.04s, 14.04s
Bytes In      [total, mean]                     61420, 1306.81
Bytes Out     [total, mean]                     2952, 62.81
Success       [ratio]                           76.60%
Status Codes  [code:count]                      0:11  200:36
Error Set:
Post "http://localhost:8080/completion": dial tcp 0.0.0.0:0->[::1]:8080: connect: connection refused

# stats
Bucket           #   %       Histogram
[0s,     25ms]   11  23.40%  ################# #<-starting up errors
[25ms,   50ms]   0   0.00%
[50ms,   250ms]  0   0.00%
[250ms,  2.5s]   0   0.00%
[2.5s,   5s]     1   2.13%   #
[5s,     7.5s]   1   2.13%   #
[7.5s,   10s]    1   2.13%   #
[10s,    12.5s]  1   2.13%   #
[12.5s,  14s]    6   12.77%  #########
[14s,    +Inf]   26  55.32%  #########################################

ggerganov

One general comment about server is that I always thought we lack extensive logs of what is happening during processing. We should probably add significantly more log messages so that it is easier to debug things and have a better understanding of the current server state. With time when the implementation stabilizes, we can reduce the logs or add verbosity levels

Just something to keep in mind for future PRs - no need to extend this one for now

ggerganov · 2024-01-23T12:53:53Z

examples/server/server.cpp

                const int task_id = llama.request_completion(data, false, false, -1);
+                llama.queue_results.add_waiting_task_id(task_id);



I wonder, do we need to explicitly add waiting task ids like this - can it become automatic within request_completion or more specifically inside the post() method?

Main thing is that I'm worried about this pattern:

void add_waiting_task_id(int task_id) { std::unique_lock<std::mutex> lock(mutex_results); waiting_task_ids.insert(task_id); }

This makes the waiting_task_ids container atomic, but it hides a potential race condition where we haven't added the task id yet, but the receiver is already waiting for it.

Similar thoughts about remove_waiting_task_id - I think it would be better to absorb it within recv(). What do you think?

During my testing, I ran into issue where user cancel the task (disconnect the HTTP stream) while the tasks are still being processed. The results are still being sent even when the socket is already closed (because the "cancel" task is not yet to be processed). The main idea of that waiting_task_ids is to keep track if the HTTP stream is still alive or not.

I did try moving both add_waiting_task_id and remove_waiting_task_id into recv(), but the problem is that in streaming mode, what if the result arrive while the HTTP server is busy sending data and that the next recv() is not yet called? In that situation, I saw some results to be dropped.

About the race condition that you pointed out, I don't think that will happen. Basically:

We always call add_waiting_task_id before recv. This is always sequential

remove_waiting_task_id only get called right before the HTTP stream is closed, so no result is dropped as long as HTTP stream is alive

My idea is inspired by the idea of socket accept, then a recv loop and finally close

Also, it worth to mention that one HTTP request is linked to one or more tasks, but one task is never be shared among HTTP requests. In the worst case, a result could not be picked up by any of the HTTP requests, it would remand in the result queue all the time. Therefore, I think it's still worth somehow keeping track of the state of HTTP connection in order not to accept "abandoned result".

But it's true that I may messed up somewhere so I saw some strange behavior. I think I'll revisit this part in another PR. What do you think about that @ggerganov ?

I could be misreading something, but I think there is definitely a race:

thread 0 -> llama.request_completion() thread 1 -> llama_server_response.send() <-- does nothing because no waiting tasks thread 0 -> llama.queue_results.add_waiting_task_id() thread 0 -> llama.queue_results.recv(task_id); <-- result is missed

Ah yeah you're right, sorry I missed that point. Thanks for having noticed.

I noticed the same potential issue with split_multiprompt_task, where I add all subtasks before adding the multitask, will definitely be fix.

Then my idea would be to generate the task ID using queue_tasks.get_next_id(), then pass this id into request_completion so that we can know the task_id before adding it to the queue.

Another idea is to modify to add a param bool start = true to queue_tasks.post(). When calling queue_tasks.post(task, false), the task not be started right away. Until we call queue_tasks.start_task(task_id_or_multitask_id) then the task (or tasks) will start. But this is quite a patch & also overkill I think.

What do you think about that?

Should be resolved in 8f36df8

The request_completion now requires passing an ID for the new task. When a new request arrives, things will be done in the order below (high-level view):

llama.queue_tasks.get_new_id()

llama.queue_results.add_waiting_task_id(task_id)

Add the task_multi if needed

Finally, add all the task_server

ngxson · 2024-01-23T13:42:25Z

@RafaAguilar Nice result, thanks!

I wonder if it's better for you to push your code into another PR. My idea is:

You make a PR with to add the test script (based on current master, to establish a baseline)
@ggerganov merge it into master
I update my branch to master, run the test script and compare it to baseline above

ggerganov · 2024-01-26T12:44:49Z

I think this is a good start

Continue to break the code into smaller files

This is probably not really necessary at this point - the code is relatively small

* server: add llama_server_queue struct * server: add llama_server_response_event * server: add comments * server: move all mutexes away from server.cpp * server: correct multitask response * server: only add back deferred tasks when one slot is available * server: fix a race condition cause by "request_completion"

ngxson added 3 commits January 20, 2024 00:25

server: add llama_server_queue struct

6e29f4c

server: add llama_server_response_event

12829b2

server: add comments

906afe7

ggerganov reviewed Jan 22, 2024

View reviewed changes

ngxson added 3 commits January 22, 2024 21:41

Merge branch 'master' into xsn/clean_up_server

58fe9cf

server: move all mutexes away from server.cpp

d87b48f

server: correct multitask response

1bd8678

ngxson commented Jan 22, 2024

View reviewed changes

ngxson requested a review from ggerganov January 22, 2024 21:50

server: only add back deferred tasks when one slot is available

d083c81

ngxson commented Jan 22, 2024

View reviewed changes

ggerganov mentioned this pull request Jan 23, 2024

server : improvements and maintenance #4216

Open

10 tasks

ggerganov reviewed Jan 23, 2024

View reviewed changes

server: fix a race condition cause by "request_completion"

8f36df8

ngxson requested a review from ggerganov January 24, 2024 13:29

ggerganov added the need feedback Testing and feedback with results are needed label Jan 25, 2024

ggerganov approved these changes Jan 26, 2024

View reviewed changes

ggerganov merged commit 48c857a into ggerganov:master Jan 26, 2024
45 checks passed

Green-Sky mentioned this pull request Jan 28, 2024

Port of self extension to server #5104

Merged

ngxson mentioned this pull request Feb 28, 2024

Clean up server code #5762

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server: try to refactor server.cpp #5065

Server: try to refactor server.cpp #5065

ngxson commented Jan 21, 2024 •

edited

Loading

ggerganov Jan 22, 2024

ngxson Jan 22, 2024

ggerganov Jan 22, 2024

ngxson Jan 22, 2024

ngxson Jan 22, 2024

ngxson Jan 22, 2024

ngxson Jan 22, 2024

RafaAguilar commented Jan 23, 2024 •

edited

Loading

ggerganov left a comment

ggerganov Jan 23, 2024

ngxson Jan 23, 2024

ggerganov Jan 23, 2024

ngxson Jan 23, 2024

ngxson Jan 23, 2024 •

edited

Loading

ngxson commented Jan 23, 2024

ggerganov commented Jan 26, 2024

		const int task_id = llama.request_completion(data, false, false, -1);
		llama.queue_results.add_waiting_task_id(task_id);

Server: try to refactor server.cpp #5065

Server: try to refactor server.cpp #5065

Conversation

ngxson commented Jan 21, 2024 • edited Loading

Motivation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RafaAguilar commented Jan 23, 2024 • edited Loading

Early Load Test

Setup

Next Steps

ggerganov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngxson Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

ngxson commented Jan 23, 2024

ggerganov commented Jan 26, 2024

ngxson commented Jan 21, 2024 •

edited

Loading

RafaAguilar commented Jan 23, 2024 •

edited

Loading

ngxson Jan 23, 2024 •

edited

Loading