problems unloading the model #28

tommilatti · 2024-12-16T19:28:35Z

Hi,

I noticed after the latest changes there seems to be issue unloading the models when ttl is reached. I set short ttl of 10seconds and tested it and this is what i see:

request: POST /v1/chat/completions 127.0.0.1 200
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! process for qwen-coder-32b-q4-draft stopped with error > signal: terminated
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.

Above is from the process stdout. If i monitor the logs remotely with browser there is no line !!! process for qwen-coder-32b-q4-draft stopped with error > signal: terminated

request: POST /v1/chat/completions 127.0.0.1 200
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.

The model was actually unloaded this time but the above output is repeated approximately 1 line per second indefinetly. I saw a situation where the model was not unloaded and the line was repeated like before, but dont have any more info on that at the moment.

Version im using is:
version: local_d6ca535 (d6ca535), built at 2024-12-16T12:45:54Z

Also there seems to be some differences between the output of stdout and remote log monitoring. I will attach web.txt and stdout.txt files so you can diff them if you want. The above can be seen in them.

stdout.txt
web.txt

This is my config:

healthCheckTimeout: 60
models:
  "qwen-coder-32b-q4":
    cmd: >
      /home/user/llama.cpp/build/bin/llama-server
      --port 9503
      -ngl 99
      --flash-attn --cache-type-k q8_0 --cache-type-v q8_0
      --samplers "temperature;top_k;top_p"
      --temp 0.1
      --model /home/user/laama/models/qwen2.5-coder/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
      --ctx-size 32000
    checkEndpoint: /health
    ttl: 180
    proxy: "http://127.0.0.1:9503"
  "qwen-coder-32b-q4-draft":
    cmd: >
      /home/user/llama.cpp/build/bin/llama-server
      --port 9503
      --flash-attn --cache-type-k q8_0 --cache-type-v q8_0
      --samplers "temperature;top_k;top_p"
      --temp 0.1
      --model /home/user/laama/models/qwen2.5-coder/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
      -ngl 99
      --ctx-size 27000
      --model-draft /home/user/laama/models/qwen2.5-coder/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
      -ngld 99
      --draft-max 16
      --draft-min 4
      --draft-p-min 0.4 
    checkEndpoint: /health
    ttl: 10
    proxy: "http://127.0.0.1:9503"

The text was updated successfully, but these errors were encountered:

- fix issue where the goroutine will continue even though the child process is no longer running and the Process' state is not Ready - fix issue where some logs were going to stdout instead of p.logMonitor causing them to not show up in the /logs

mostlygeek · 2024-12-16T20:03:58Z

I was able to able to reproduce the missing logs and the repeated logs. Thanks for reporting it.

Can you try out the improve-stop-exceptions branch and let me know if that fixes things for you?

tommilatti · 2024-12-16T20:14:23Z

Awesome! now the unloading seems to work like before. Noticed the remote logs can get garbled output in the browser while the model is loading, but it will fix itself with F5 when the model has finished loading.

It looked like this before reload:

srv    load_model: loading draft model '/home/user/laama/models/qwen2.5-coder/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf'
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 1399 MiB free
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
a_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
4

llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 Coder 0.5B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv  12:                               general.tags arr[str,6]       = ["code", "codeqwen", "chat", "qwen", ...
r
llama_model_loader: - kv   5:                         general.size_label str              = 0.5B
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
a_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
4

llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 Coder 0.5B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv  12:                               general.tags arr[str,6]       = ["code", "codeqwen", "chat", "qwen", ...
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
a_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
4

llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
a_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
4
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwellama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwllama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
a_model_lllama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
a_mllama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
a_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false

Same thing was in the previous txt files if you look at those kv XX numbers they dont show up in order. But this might be issue with the server cpu load while the models are loaded?

Thanks for the quick fix!

Stop Process TTL goroutine when process is not ready (#28) - fix issue where the goroutine will continue even though the child process is no longer running and the Process' state is not Ready - fix issue where some logs were going to stdout instead of p.logMonitor causing them to not show up in the /logs - add units to unloading model message

mostlygeek · 2024-12-17T00:25:35Z

Alright chased down that last logging bug. Things looks good on my end. I pushed a new release v76 which should be ready soon. All changes are in the main branch now. If it's not fixed for you please reopen the issue.

mostlygeek self-assigned this Dec 16, 2024

mostlygeek added the bug Something isn't working label Dec 16, 2024

mostlygeek added a commit that referenced this issue Dec 17, 2024

fix bad logging due to wrong []byte used #28

7183f6b

mostlygeek closed this as completed Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problems unloading the model #28

problems unloading the model #28

tommilatti commented Dec 16, 2024

mostlygeek commented Dec 16, 2024

tommilatti commented Dec 16, 2024

mostlygeek commented Dec 17, 2024 •

edited

Loading

problems unloading the model #28

problems unloading the model #28

Comments

tommilatti commented Dec 16, 2024

mostlygeek commented Dec 16, 2024

tommilatti commented Dec 16, 2024

mostlygeek commented Dec 17, 2024 • edited Loading

mostlygeek commented Dec 17, 2024 •

edited

Loading