Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can get work on tabby 0.13.1 or 0.14.0 follow by the quick-start guide #2719

Open
yourchanges opened this issue Jul 24, 2024 · 4 comments
Open

Comments

@yourchanges
Copy link

Describe the bug
Can get work on tabby 0.13.1 or 0.14.0 follow by the quick-start guide, it's just start process with the a embed model

/opt/tabby/bin/llama-server -m /data/models/TabbyML/Nomic-Embed-Text/ggml/model.gguf --cont-batching --port 30888 -np 1 --log-disable --ctx-size 4096 -ngl 9999 --embedding --ubatch-size 4096

and hang for ever

docker run -it --name tabbyserver4 --restart=unless-stopped --gpus '"device=0"' -p 8082:8080    -v /data/tabby:/data tabbyml/tabby serve --model StarCoder-1B --chat-model Qwen2-1.5B-Instruct --device cuda
Writing to new file.
🎯 Downloaded https://huggingface.co/TabbyML/models/resolve/main/starcoderbase-1B.Q8_0.gguf to /data/models/TabbyML/StarCoder-1B/ggml/model.gguf.tmp
   00:03:02 ▕████████████████████▏ 1.23 GiB/1.23 GiB  6.88 MiB/s  ETA 0s.                                                                                                                                                                   ✅ Checksum OK.
Writing to new file.
🎯 Downloaded https://huggingface.co/Qwen/Qwen2-1.5B-Instruct-GGUF/resolve/main/qwen2-1_5b-instruct-q8_0.gguf to /data/models/TabbyML/Qwen2-1.5B-Instruct/ggml/model.gguf.tmp
   00:03:37 ▕████████████████████▏ 1.53 GiB/1.53 GiB  7.22 MiB/s  ETA 0s.                                                                                                                                                                   ✅ Checksum OK.
⠋  2173.060 s	Starting...2024-07-24T07:25:27.218916Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:99: llama-server <embedding> exited with status code -1
2024-07-24T07:25:27.218935Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from /data/models/TabbyML/Nomic-Embed-Text/ggml/model.gguf (version GGUF V3 (latest))
2024-07-24T07:25:27.218940Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-07-24T07:25:27.218943Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
2024-07-24T07:25:27.218946Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
2024-07-24T07:25:27.218950Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
2024-07-24T07:25:27.218953Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
2024-07-24T07:25:27.218960Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
2024-07-24T07:25:27.218962Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
2024-07-24T07:25:27.218964Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
2024-07-24T07:25:27.218965Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
2024-07-24T07:25:27.218968Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   8:                          general.file_type u32              = 7
2024-07-24T07:25:27.218971Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
2024-07-24T07:25:27.218974Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
2024-07-24T07:25:27.218982Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
2024-07-24T07:25:27.218983Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
2024-07-24T07:25:27.218985Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
2024-07-24T07:25:27.218986Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
2024-07-24T07:25:27.218988Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
2024-07-24T07:25:27.218991Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
2024-07-24T07:25:27.218992Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
2024-07-24T07:25:27.218994Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2024-07-24T07:25:27.218996Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
2024-07-24T07:25:27.218999Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
2024-07-24T07:25:27.219005Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
2024-07-24T07:25:27.219009Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  22:               general.quantization_version u32              = 2
2024-07-24T07:25:27.219012Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - type  f32:   51 tensors
2024-07-24T07:25:27.219016Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - type q8_0:   61 tensors
2024-07-24T07:25:27.219021Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_vocab: special tokens cache size = 5
2024-07-24T07:25:27.219026Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_vocab: token to piece cache size = 0.2032 MB
2024-07-24T07:25:27.219031Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: format           = GGUF V3 (latest)
2024-07-24T07:25:27.219036Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: arch             = nomic-bert
2024-07-24T07:25:27.219041Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: vocab type       = WPM
2024-07-24T07:25:27.219047Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_vocab          = 30522
2024-07-24T07:25:27.219051Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_merges         = 0
2024-07-24T07:25:27.219056Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: vocab_only       = 0
2024-07-24T07:25:27.219064Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ctx_train      = 2048
2024-07-24T07:25:27.219071Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd           = 768
2024-07-24T07:25:27.219078Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_layer          = 12
2024-07-24T07:25:27.219084Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_head           = 12
2024-07-24T07:25:27.219091Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_head_kv        = 12
2024-07-24T07:25:27.219099Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_rot            = 64
2024-07-24T07:25:27.219105Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_swa            = 0
2024-07-24T07:25:27.219111Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_head_k    = 64
2024-07-24T07:25:27.219118Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_head_v    = 64
2024-07-24T07:25:27.219125Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_gqa            = 1
2024-07-24T07:25:27.219133Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_k_gqa     = 768
2024-07-24T07:25:27.219139Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_v_gqa     = 768
2024-07-24T07:25:27.219143Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_norm_eps       = 1.0e-12
2024-07-24T07:25:27.219149Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
2024-07-24T07:25:27.219157Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
2024-07-24T07:25:27.219176Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-07-24T07:25:27.219186Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_logit_scale    = 0.0e+00
2024-07-24T07:25:27.219193Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ff             = 3072
2024-07-24T07:25:27.219210Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_expert         = 0
2024-07-24T07:25:27.219218Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_expert_used    = 0
2024-07-24T07:25:27.219224Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: causal attn      = 0
2024-07-24T07:25:27.219251Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: pooling type     = 1
2024-07-24T07:25:27.219254Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope type        = 2
2024-07-24T07:25:27.219257Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope scaling     = linear
2024-07-24T07:25:27.219260Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: freq_base_train  = 1000.0
2024-07-24T07:25:27.219265Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: freq_scale_train = 1
2024-07-24T07:25:27.219270Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ctx_orig_yarn  = 2048
2024-07-24T07:25:27.219275Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope_finetuned   = unknown
2024-07-24T07:25:27.219280Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_conv       = 0
2024-07-24T07:25:27.219286Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_inner      = 0
2024-07-24T07:25:27.219291Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_state      = 0
2024-07-24T07:25:27.219298Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_dt_rank      = 0
2024-07-24T07:25:27.219304Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model type       = 137M
2024-07-24T07:25:27.219309Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model ftype      = Q8_0
2024-07-24T07:25:27.219315Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model params     = 136.73 M
2024-07-24T07:25:27.219329Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model size       = 138.65 MiB (8.51 BPW)
2024-07-24T07:25:27.219334Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: general.name     = nomic-embed-text-v1.5
2024-07-24T07:25:27.219343Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: BOS token        = 101 '[CLS]'
2024-07-24T07:25:27.219347Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: EOS token        = 102 '[SEP]'
2024-07-24T07:25:27.219352Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: UNK token        = 100 '[UNK]'
2024-07-24T07:25:27.219362Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: SEP token        = 102 '[SEP]'
2024-07-24T07:25:27.219365Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: PAD token        = 0 '[PAD]'
2024-07-24T07:25:27.219371Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: CLS token        = 101 '[CLS]'
2024-07-24T07:25:27.219374Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: MASK token       = 103 '[MASK]'
2024-07-24T07:25:27.219376Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: LF token         = 0 '[PAD]'
2024-07-24T07:25:27.219378Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: max token length = 21
2024-07-24T07:25:27.219381Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
2024-07-24T07:25:27.219387Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
2024-07-24T07:25:27.219390Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: found 1 CUDA devices:
2024-07-24T07:25:27.219392Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>:   Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
⠸  2174.102 s	Starting...^C2024-07-24T07:25:28.289106Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:99: llama-server <embedding> exited with status code -1
2024-07-24T07:25:28.289123Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: loaded meta data with 23 key-value pairs and 112 tensors from /data/models/TabbyML/Nomic-Embed-Text/ggml/model.gguf (version GGUF V3 (latest))
2024-07-24T07:25:28.289126Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-07-24T07:25:28.289129Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
2024-07-24T07:25:28.289132Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
2024-07-24T07:25:28.289135Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
2024-07-24T07:25:28.289138Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
2024-07-24T07:25:28.289141Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
2024-07-24T07:25:28.289143Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
2024-07-24T07:25:28.289146Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
2024-07-24T07:25:28.289149Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
2024-07-24T07:25:28.289152Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   8:                          general.file_type u32              = 7
2024-07-24T07:25:28.289155Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
2024-07-24T07:25:28.289157Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
2024-07-24T07:25:28.289160Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
2024-07-24T07:25:28.289162Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
2024-07-24T07:25:28.289165Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
2024-07-24T07:25:28.289168Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
2024-07-24T07:25:28.289170Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
2024-07-24T07:25:28.289173Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
2024-07-24T07:25:28.289176Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
2024-07-24T07:25:28.289178Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2024-07-24T07:25:28.289181Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
2024-07-24T07:25:28.289184Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
2024-07-24T07:25:28.289187Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
2024-07-24T07:25:28.289189Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - kv  22:               general.quantization_version u32              = 2
2024-07-24T07:25:28.289192Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - type  f32:   51 tensors
2024-07-24T07:25:28.289195Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llama_model_loader: - type q8_0:   61 tensors
2024-07-24T07:25:28.289197Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_vocab: special tokens cache size = 5
2024-07-24T07:25:28.289200Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_vocab: token to piece cache size = 0.2032 MB
2024-07-24T07:25:28.289203Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: format           = GGUF V3 (latest)
2024-07-24T07:25:28.289205Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: arch             = nomic-bert
2024-07-24T07:25:28.289208Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: vocab type       = WPM
2024-07-24T07:25:28.289211Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_vocab          = 30522
2024-07-24T07:25:28.289213Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_merges         = 0
2024-07-24T07:25:28.289216Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: vocab_only       = 0
2024-07-24T07:25:28.289219Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ctx_train      = 2048
2024-07-24T07:25:28.289222Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd           = 768
2024-07-24T07:25:28.289224Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_layer          = 12
2024-07-24T07:25:28.289227Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_head           = 12
2024-07-24T07:25:28.289230Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_head_kv        = 12
2024-07-24T07:25:28.289233Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_rot            = 64
2024-07-24T07:25:28.289256Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_swa            = 0
2024-07-24T07:25:28.289259Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_head_k    = 64
2024-07-24T07:25:28.289261Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_head_v    = 64
2024-07-24T07:25:28.289264Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_gqa            = 1
2024-07-24T07:25:28.289266Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_k_gqa     = 768
2024-07-24T07:25:28.289269Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_embd_v_gqa     = 768
2024-07-24T07:25:28.289272Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_norm_eps       = 1.0e-12
2024-07-24T07:25:28.289274Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
2024-07-24T07:25:28.289277Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
2024-07-24T07:25:28.289292Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-07-24T07:25:28.289295Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: f_logit_scale    = 0.0e+00
2024-07-24T07:25:28.289305Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ff             = 3072
2024-07-24T07:25:28.289308Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_expert         = 0
2024-07-24T07:25:28.289311Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_expert_used    = 0
2024-07-24T07:25:28.289325Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: causal attn      = 0
2024-07-24T07:25:28.289328Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: pooling type     = 1
2024-07-24T07:25:28.289330Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope type        = 2
2024-07-24T07:25:28.289332Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope scaling     = linear
2024-07-24T07:25:28.289335Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: freq_base_train  = 1000.0
2024-07-24T07:25:28.289338Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: freq_scale_train = 1
2024-07-24T07:25:28.289348Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: n_ctx_orig_yarn  = 2048
2024-07-24T07:25:28.289351Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: rope_finetuned   = unknown
2024-07-24T07:25:28.289353Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_conv       = 0
2024-07-24T07:25:28.289355Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_inner      = 0
2024-07-24T07:25:28.289357Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_d_state      = 0
2024-07-24T07:25:28.289364Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: ssm_dt_rank      = 0
2024-07-24T07:25:28.289366Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model type       = 137M
2024-07-24T07:25:28.289367Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model ftype      = Q8_0
2024-07-24T07:25:28.289369Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model params     = 136.73 M
2024-07-24T07:25:28.289371Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: model size       = 138.65 MiB (8.51 BPW)
2024-07-24T07:25:28.289373Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: general.name     = nomic-embed-text-v1.5
2024-07-24T07:25:28.289378Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: BOS token        = 101 '[CLS]'
2024-07-24T07:25:28.289380Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: EOS token        = 102 '[SEP]'
2024-07-24T07:25:28.289383Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: UNK token        = 100 '[UNK]'
2024-07-24T07:25:28.289386Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: SEP token        = 102 '[SEP]'
2024-07-24T07:25:28.289388Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: PAD token        = 0 '[PAD]'
2024-07-24T07:25:28.289390Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: CLS token        = 101 '[CLS]'
2024-07-24T07:25:28.289392Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: MASK token       = 103 '[MASK]'
2024-07-24T07:25:28.289394Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: LF token         = 0 '[PAD]'
2024-07-24T07:25:28.289397Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: llm_load_print_meta: max token length = 21
2024-07-24T07:25:28.289399Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
2024-07-24T07:25:28.289403Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
2024-07-24T07:25:28.289405Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: ggml_cuda_init: found 1 CUDA devices:
2024-07-24T07:25:28.289408Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>:   Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

Information about your version
0.14.0 or 0.13.1

Information about your GPU

Wed Jul 24 15:30:02 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080        On  | 00000000:01:00.0 Off |                  N/A |
| 50%   43C    P8              19W / 320W |     29MiB / 20480MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1995      G   /usr/libexec/Xorg                            12MiB |
|    0   N/A  N/A      3008      G   gnome-shell                                   4MiB |
|    0   N/A  N/A      3923      G   /usr/libexec/gnome-initial-setup              3MiB |
+---------------------------------------------------------------------------------------+
@kba-tmn3
Copy link

kba-tmn3 commented Aug 7, 2024

I have the same issue, how to troubleshoot it?

@kitswas
Copy link

kitswas commented Aug 15, 2024

Same here. Running with

docker run -it --gpus all   -p 8080:8080 -v $HOME/.tabby:/data   tabbyml/tabby serve --model StarCoder-1B --chat-model Qwen2-1.5B-Instruct --device cuda

GPU info:

Thu Aug 15 10:11:51 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1650        Off | 00000000:01:00.0 Off |                  N/A |
| N/A   44C    P0               6W /  50W |      3MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2642      G   /usr/bin/gnome-shell                          1MiB |
+---------------------------------------------------------------------------------------+

@MaxenceBouvier
Copy link

MaxenceBouvier commented Aug 20, 2024

Same issue as well.
Going back to tabby v12.0 seems to work for me.
(When serving CodeGemma-7B, without webserver.)

@wsxiaoys
Copy link
Member

wsxiaoys commented Aug 20, 2024

Thank you for reporting the issues. The changes in https://github.com/TabbyML/tabby/pull/2925/files will be included in the 0.16 release and will provide more detailed information in the logs to assist with debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants