Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problems unloading the model #28

Closed
tommilatti opened this issue Dec 16, 2024 · 3 comments
Closed

problems unloading the model #28

tommilatti opened this issue Dec 16, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@tommilatti
Copy link

Hi,

I noticed after the latest changes there seems to be issue unloading the models when ttl is reached. I set short ttl of 10seconds and tested it and this is what i see:

request: POST /v1/chat/completions 127.0.0.1 200
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! process for qwen-coder-32b-q4-draft stopped with error > signal: terminated
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.

Above is from the process stdout. If i monitor the logs remotely with browser there is no line !!! process for qwen-coder-32b-q4-draft stopped with error > signal: terminated

request: POST /v1/chat/completions 127.0.0.1 200
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.
!!! Unloading model qwen-coder-32b-q4-draft, TTL of 10 reached.

The model was actually unloaded this time but the above output is repeated approximately 1 line per second indefinetly. I saw a situation where the model was not unloaded and the line was repeated like before, but dont have any more info on that at the moment.

Version im using is:
version: local_d6ca535 (d6ca535), built at 2024-12-16T12:45:54Z

Also there seems to be some differences between the output of stdout and remote log monitoring. I will attach web.txt and stdout.txt files so you can diff them if you want. The above can be seen in them.

stdout.txt
web.txt

This is my config:

healthCheckTimeout: 60
models:
  "qwen-coder-32b-q4":
    cmd: >
      /home/user/llama.cpp/build/bin/llama-server
      --port 9503
      -ngl 99
      --flash-attn --cache-type-k q8_0 --cache-type-v q8_0
      --samplers "temperature;top_k;top_p"
      --temp 0.1
      --model /home/user/laama/models/qwen2.5-coder/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
      --ctx-size 32000
    checkEndpoint: /health
    ttl: 180
    proxy: "http://127.0.0.1:9503"
  "qwen-coder-32b-q4-draft":
    cmd: >
      /home/user/llama.cpp/build/bin/llama-server
      --port 9503
      --flash-attn --cache-type-k q8_0 --cache-type-v q8_0
      --samplers "temperature;top_k;top_p"
      --temp 0.1
      --model /home/user/laama/models/qwen2.5-coder/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
      -ngl 99
      --ctx-size 27000
      --model-draft /home/user/laama/models/qwen2.5-coder/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
      -ngld 99
      --draft-max 16
      --draft-min 4
      --draft-p-min 0.4 
    checkEndpoint: /health
    ttl: 10
    proxy: "http://127.0.0.1:9503"

mostlygeek added a commit that referenced this issue Dec 16, 2024
- fix issue where the goroutine will continue even though the child
  process is no longer running and the Process' state is not Ready
- fix issue where some logs were going to stdout instead of p.logMonitor
  causing them to not show up in the /logs
@mostlygeek mostlygeek self-assigned this Dec 16, 2024
@mostlygeek mostlygeek added the bug Something isn't working label Dec 16, 2024
@mostlygeek
Copy link
Owner

I was able to able to reproduce the missing logs and the repeated logs. Thanks for reporting it.

Can you try out the improve-stop-exceptions branch and let me know if that fixes things for you?

@tommilatti
Copy link
Author

Awesome! now the unloading seems to work like before. Noticed the remote logs can get garbled output in the browser while the model is loading, but it will fix itself with F5 when the model has finished loading.

It looked like this before reload:

srv    load_model: loading draft model '/home/user/laama/models/qwen2.5-coder/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf'
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 1399 MiB free
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
a_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
4

llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 Coder 0.5B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv  12:                               general.tags arr[str,6]       = ["code", "codeqwen", "chat", "qwen", ...
r
llama_model_loader: - kv   5:                         general.size_label str              = 0.5B
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
a_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
4

llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 Coder 0.5B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv  12:                               general.tags arr[str,6]       = ["code", "codeqwen", "chat", "qwen", ...
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
a_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
4

llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
a_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
4
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwellama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwllama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
a_model_lllama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
a_mllama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
a_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false

Same thing was in the previous txt files if you look at those kv XX numbers they dont show up in order. But this might be issue with the server cpu load while the models are loaded?

Thanks for the quick fix!

mostlygeek added a commit that referenced this issue Dec 16, 2024
Stop Process TTL goroutine when process is not ready (#28)

- fix issue where the goroutine will continue even though the child
  process is no longer running and the Process' state is not Ready
- fix issue where some logs were going to stdout instead of p.logMonitor
  causing them to not show up in the /logs
- add units to unloading model message
@mostlygeek
Copy link
Owner

mostlygeek commented Dec 17, 2024

Alright chased down that last logging bug. Things looks good on my end. I pushed a new release v76 which should be ready soon. All changes are in the main branch now. If it's not fixed for you please reopen the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants