Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: try to refactor server.cpp #5065

Merged
merged 8 commits into from
Jan 26, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jan 21, 2024

Motivation

The current server.cpp code is quite complicated to understand. This is my first trial to refactor it.

I don't even know if I'm heading in the right direction or not. I really need some helps and I'd appreciate all the suggestions and critics.

Here's what I succeeded to do so far:

  • Move all oai-compat functions to oai.hpp
  • Move utilities and type definitions to utils.hpp
  • Wrap the queue_tasks into llama_server_queue interface, which will manage its own mutex (the idea it is not to expose any mutex to the outside of that class)
  • Wrap queue_results into llama_server_response_event. My initial idea was to register one callback per HTTP stream, but I didn't success because the set_chunked_content_provider function still being done in blocking way.

My idea in the long term is:

  • Continue to break the code into smaller files. Ideally, we can only have argument parsers and some high-level calls in server.cpp
  • Rewrite the task system, so that we can get rid of the llama_server_queue.on_all_tasks_finished
  • Unify multitask and task into one place, where we can do something like "update this multitask when this task is finished"
  • Unify slot and task into one place, where we can do something like "defer the task to run when a slot is available"

typedef std::function<void(int, int, task_result&)> callback_multitask_t;
callback_multitask_t callback_update_multitask;
// for keeping track of all tasks waiting for the result
std::mutex mutex_task_ids;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mutex is not needed.

In general, server should never have more than 2 mutexes - one for input and one for output. If we add a 3rd one, then something is not right

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thanks for the notice. I'll see what I can do to remove that mutex_task_ids

std::condition_variable condition_tasks;
std::mutex mutex_results;
std::condition_variable condition_results;
std::mutex mutex_multitasks;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring all synchronization and queues into utils.hpp is OK, but the result should be that there are no longer any mutexes or locks in server.cpp

Can we avoid this somehow?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, and I suspect that queue_multitasks may not even need mutex.

The reason why we needed mutex was because queue_multitasks is modified by both main thread and http thread (by calling llama.next_result)

But my new implementation of llama.next_result (moved to llama_server_response_event.recv) does not touch queue_multitasks anymore. Instead, the multitask queue is updated via llama_server_response_event.send, which is always called by main thread.

But ideally, I want multitask & task to be in the same place, i.e. maybe add it to the new struct llama_server_queue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multitasks is moved into llama_server_queue in d87b48f and 1bd8678

I didn't have many idea to test it though, but seems to work just fine:


How I tested it: Send POST to http://localhost:8080/completion

Body:

{
    "prompt": ["hi", "bonjour"],
    "n_predict": 10
}

Response (expected 1 response in english and 1 in french):

{
    "results": [
        {
            "content": ", 2018)\n\nSangam (2019",
            ...
        },
        {
            "content": " le chôme-Ouest, b",
            ...
        }
    ]
}

Another test using OAI /v1/chat/completions and stream: true also works fine.

I didn't have time to test embedding API, but I believe that it should work.

// copy aggregate results of complete multi-tasks to the results queue
std::lock_guard<std::mutex> lock_results(mutex_results);
queue_results.insert(queue_results.end(), agg_results.begin(), agg_results.end());
result.result_json = json{ { "results", result_jsons } };
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: I also correct the response data of /completion endpoint (root should be an object result: ..., not an array). I think because most people use the OAI endpoint instead, so no one else has noticed the bug 😄

@ngxson ngxson requested a review from ggerganov January 22, 2024 21:50
}

// Call when the state of one slot is changed
void notify_slot_changed() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is optimized version of #5018 . Instead of pushing the deferred tasks back every iteration of main loop (brute force), here I only push them back when a slot change its state.

@RafaAguilar
Copy link

RafaAguilar commented Jan 23, 2024

Early Load Test

I have used two different tools with different load approaches strategies, with both I obtained over 40 successful requests over 120 seconds, with about ~0.27 request per second as throughout, using the same body as in ReadMe in all requests.

Setup

  • 1 Slot
  • Model llama-2-7b.Q8_0
  • Target rate: 1 request per second
  • Max up to 5 requests in parallel
  • Length 120 seconds
  • M1 Max

Next Steps

  • Add warm up period for both
  • Use lighter model (as suggested by @ngxson)
  • Slightly Randomize prompts
  • Try to establish a baseline
  • Automate (.sh) install and run

## Raw Results

Server setup:

./server -m ../llms/models/llama2/llama-2-7b.Q8_0.gguf -c 2048 -t 8 -ngl 99
{"timestamp":1706006199,"level":"INFO","function":"main","line":2295,"message":"build info","build":1960,"commit":"d083c81"}
{"timestamp":1706006199,"level":"INFO","function":"main","line":2302,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":10,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}

llama server listening at http://127.0.0.1:8080

{"timestamp":1706006199,"level":"INFO","function":"main","line":2401,"message":"HTTP server listening","port":"8080","hostname":"127.0.0.1"}
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ../llms/models/llama2/llama-2-7b.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 6.67 GiB (8.50 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  6695.84 MiB, ( 6695.91 / 49152.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      Metal buffer size =  6695.83 MiB
llm_load_tensors:        CPU buffer size =   132.81 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/rafa/repos/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  1024.00 MiB, ( 7721.47 / 49152.00)
llama_kv_cache_init:      Metal KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, ( 7721.48 / 49152.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   172.02 MiB, ( 7893.48 / 49152.00)
llama_new_context_with_model: graph splits (measure): 3
llama_new_context_with_model:      Metal compute buffer size =   172.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    12.00 MiB
Available slots:
 -> Slot 0 - max context: 2048
{"timestamp":1706006199,"level":"INFO","function":"main","line":2422,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)

Using K6:

scenarios: (100.00%) 1 scenario, 5 max VUs, 2m45s max duration (incl. graceful stop):
           * default: Up to 5 looping VUs for 2m15s over 2 stages (gracefulRampDown: 30s, gracefulStop: 30s)


     ✓ Status is 200
     ✓ Content-Type is application/json

     checks.........................: 100.00% ✓ 84       ✗ 0
     data_received..................: 79 kB   528 B/s
     data_sent......................: 9.3 kB  62 B/s
     http_req_blocked...............: avg=104.42µs min=1µs   med=2µs    max=1.1ms  p(90)=412.7µs  p(95)=432.65µs
     http_req_connecting............: avg=75.95µs  min=0s    med=0s     max=411µs  p(90)=293.39µs p(95)=388.95µs
     http_req_duration..............: avg=16.09s   min=3.51s med=17.79s max=18.23s p(90)=18.05s   p(95)=18.19s
       { expected_response:true }...: avg=16.09s   min=3.51s med=17.79s max=18.23s p(90)=18.05s   p(95)=18.19s
     http_req_failed................: 0.00%   ✓ 0        ✗ 42
     http_req_receiving.............: avg=43.59µs  min=29µs  med=40µs   max=101µs  p(90)=54µs     p(95)=65.59µs
     http_req_sending...............: avg=18.42µs  min=10µs  med=13.5µs max=48µs   p(90)=37.6µs   p(95)=40.95µs
     http_req_tls_handshaking.......: avg=0s       min=0s    med=0s     max=0s     p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=16.09s   min=3.51s med=17.79s max=18.23s p(90)=18.05s   p(95)=18.19s
     http_reqs......................: 42      0.280342/s
     iteration_duration.............: avg=16.09s   min=3.51s med=17.79s max=18.23s p(90)=18.05s   p(95)=18.19s
     iterations.....................: 42      0.280342/s
     vus............................: 1       min=1      max=5
     vus_max........................: 5       min=5      max=5


running (2m29.8s), 0/5 VUs, 42 complete and 0 interrupted iterations

Using Vegeta:

Requests      [total, rate, throughput]         47, 0.39, 0.27
Duration      [total, attack, wait]             2m13s, 1m59s, 14.036s
Latencies     [min, mean, 50, 90, 95, 99, max]  142.042µs, 10.176s, 14.011s, 14.037s, 14.038s, 14.04s, 14.04s
Bytes In      [total, mean]                     61420, 1306.81
Bytes Out     [total, mean]                     2952, 62.81
Success       [ratio]                           76.60%
Status Codes  [code:count]                      0:11  200:36
Error Set:
Post "http://localhost:8080/completion": dial tcp 0.0.0.0:0->[::1]:8080: connect: connection refused

# stats
Bucket           #   %       Histogram
[0s,     25ms]   11  23.40%  ################# #<-starting up errors
[25ms,   50ms]   0   0.00%
[50ms,   250ms]  0   0.00%
[250ms,  2.5s]   0   0.00%
[2.5s,   5s]     1   2.13%   #
[5s,     7.5s]   1   2.13%   #
[7.5s,   10s]    1   2.13%   #
[10s,    12.5s]  1   2.13%   #
[12.5s,  14s]    6   12.77%  #########
[14s,    +Inf]   26  55.32%  #########################################

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One general comment about server is that I always thought we lack extensive logs of what is happening during processing. We should probably add significantly more log messages so that it is easier to debug things and have a better understanding of the current server state. With time when the implementation stabilizes, we can reduce the logs or add verbosity levels

Just something to keep in mind for future PRs - no need to extend this one for now

Comment on lines 2595 to 2597
const int task_id = llama.request_completion(data, false, false, -1);
llama.queue_results.add_waiting_task_id(task_id);

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, do we need to explicitly add waiting task ids like this - can it become automatic within request_completion or more specifically inside the post() method?

Main thing is that I'm worried about this pattern:

    void add_waiting_task_id(int task_id) {
        std::unique_lock<std::mutex> lock(mutex_results);
        waiting_task_ids.insert(task_id);
    }

This makes the waiting_task_ids container atomic, but it hides a potential race condition where we haven't added the task id yet, but the receiver is already waiting for it.

Similar thoughts about remove_waiting_task_id - I think it would be better to absorb it within recv(). What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During my testing, I ran into issue where user cancel the task (disconnect the HTTP stream) while the tasks are still being processed. The results are still being sent even when the socket is already closed (because the "cancel" task is not yet to be processed). The main idea of that waiting_task_ids is to keep track if the HTTP stream is still alive or not.

I did try moving both add_waiting_task_id and remove_waiting_task_id into recv(), but the problem is that in streaming mode, what if the result arrive while the HTTP server is busy sending data and that the next recv() is not yet called? In that situation, I saw some results to be dropped.

About the race condition that you pointed out, I don't think that will happen. Basically:

  • We always call add_waiting_task_id before recv. This is always sequential
  • remove_waiting_task_id only get called right before the HTTP stream is closed, so no result is dropped as long as HTTP stream is alive
  • My idea is inspired by the idea of socket accept, then a recv loop and finally close

Also, it worth to mention that one HTTP request is linked to one or more tasks, but one task is never be shared among HTTP requests. In the worst case, a result could not be picked up by any of the HTTP requests, it would remand in the result queue all the time. Therefore, I think it's still worth somehow keeping track of the state of HTTP connection in order not to accept "abandoned result".

But it's true that I may messed up somewhere so I saw some strange behavior. I think I'll revisit this part in another PR. What do you think about that @ggerganov ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be misreading something, but I think there is definitely a race:

thread 0   -> llama.request_completion()
thread   1 -> llama_server_response.send()       <-- does nothing because no waiting tasks
thread 0   -> llama.queue_results.add_waiting_task_id()
thread 0   -> llama.queue_results.recv(task_id); <-- result is missed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah you're right, sorry I missed that point. Thanks for having noticed.

I noticed the same potential issue with split_multiprompt_task, where I add all subtasks before adding the multitask, will definitely be fix.

Then my idea would be to generate the task ID using queue_tasks.get_next_id(), then pass this id into request_completion so that we can know the task_id before adding it to the queue.

Another idea is to modify to add a param bool start = true to queue_tasks.post(). When calling queue_tasks.post(task, false), the task not be started right away. Until we call queue_tasks.start_task(task_id_or_multitask_id) then the task (or tasks) will start. But this is quite a patch & also overkill I think.

What do you think about that?

Copy link
Collaborator Author

@ngxson ngxson Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be resolved in 8f36df8

The request_completion now requires passing an ID for the new task. When a new request arrives, things will be done in the order below (high-level view):

  1. llama.queue_tasks.get_new_id()
  2. llama.queue_results.add_waiting_task_id(task_id)
  3. Add the task_multi if needed
  4. Finally, add all the task_server

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 23, 2024

@RafaAguilar Nice result, thanks!

I wonder if it's better for you to push your code into another PR. My idea is:

  • You make a PR with to add the test script (based on current master, to establish a baseline)
  • @ggerganov merge it into master
  • I update my branch to master, run the test script and compare it to baseline above

@ngxson ngxson requested a review from ggerganov January 24, 2024 13:29
@ggerganov ggerganov added the need feedback Testing and feedback with results are needed label Jan 25, 2024
@ggerganov ggerganov merged commit 48c857a into ggerganov:master Jan 26, 2024
45 checks passed
@ggerganov
Copy link
Owner

I think this is a good start

Continue to break the code into smaller files

This is probably not really necessary at this point - the code is relatively small

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* server: add llama_server_queue struct

* server: add llama_server_response_event

* server: add comments

* server: move all mutexes away from server.cpp

* server: correct multitask response

* server: only add back deferred tasks when one slot is available

* server: fix a race condition cause by "request_completion"
@ngxson ngxson mentioned this pull request Feb 28, 2024
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* server: add llama_server_queue struct

* server: add llama_server_response_event

* server: add comments

* server: move all mutexes away from server.cpp

* server: correct multitask response

* server: only add back deferred tasks when one slot is available

* server: fix a race condition cause by "request_completion"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants