Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: multiple GPU support for llama.cpp engine #1202

Closed
FalconIA opened this issue Mar 28, 2024 · 5 comments
Closed

ENH: multiple GPU support for llama.cpp engine #1202

FalconIA opened this issue Mar 28, 2024 · 5 comments
Labels
enhancement New feature or request gpu
Milestone

Comments

@FalconIA
Copy link

FalconIA commented Mar 28, 2024

Is your feature request related to a problem? Please describe

启动GGUF模型时,总是只能使用一颗GPU

xinference  | 2024-03-28 01:34:02,909 xinference.core.worker 202 DEBUG    Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f3831ffda30>,), kwargs: {'model_uid': 'qwen1.5-72b-q3-1-0', 'model_name': 'qwen1.5-chat-offline', 'model_size_in_billions': 72, 'model_format': 'ggufv2', 'quantization': 'q3_k_m', 'model_type': 'LLM', 'n_gpu': 'auto', 'request_limits': None, 'peft_model_path': None, 'image_lora_load_kwargs': None, 'image_lora_fuse_kwargs': None}
xinference  | 2024-03-28 01:34:02,910 xinference.core.worker 202 DEBUG    GPU selected: [0] for model qwen1.5-72b-q3-1-0
xinference  | 2024-03-28 01:34:17,402 xinference.model.llm.llm_family 202 INFO     Caching from URI: file:///opt/models/llm-gguf/Qwen/Qwen1.5-72B-Chat-GGUF
xinference  | 2024-03-28 01:34:17,411 xinference.model.llm.llm_family 202 INFO     Cache /opt/models/llm-gguf/Qwen/Qwen1.5-72B-Chat-GGUF exists
xinference  | 2024-03-28 01:34:17,412 xinference.model.llm.core 202 DEBUG    Launching qwen1.5-72b-q3-1-0 with LlamaCppChatModel
xinference  | ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
xinference  | ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
xinference  | ggml_init_cublas: found 1 CUDA devices:
xinference  |   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
xinference  | llama_model_loader: loaded meta data with 21 key-value pairs and 963 tensors from /opt/models/llm-gguf/Qwen/Qwen1.5-72B-Chat-GGUF/qwen1_5-72b-chat-q3_k_m.gguf (version GGUF V3 (latest))
xinference  | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
xinference  | llama_model_loader: - kv   0:                       general.architecture str              = qwen2
xinference  | llama_model_loader: - kv   1:                               general.name str              = Qwen1.5-72B-Chat-AWQ-fp16
xinference  | llama_model_loader: - kv   2:                          qwen2.block_count u32              = 80
xinference  | llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
xinference  | llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 8192
xinference  | llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 24576
xinference  | llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 64
xinference  | llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 64
xinference  | llama_model_loader: - kv   8:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
xinference  | llama_model_loader: - kv   9:                       qwen2.rope.freq_base f32              = 1000000.000000
xinference  | llama_model_loader: - kv  10:                qwen2.use_parallel_residual bool             = true
xinference  | llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
xinference  | llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
xinference  | llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
xinference  | llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
xinference  | llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 151645
xinference  | llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 151643
xinference  | llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 151643
xinference  | llama_model_loader: - kv  18:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
xinference  | llama_model_loader: - kv  19:               general.quantization_version u32              = 2
xinference  | llama_model_loader: - kv  20:                          general.file_type u32              = 12
xinference  | llama_model_loader: - type  f32:  401 tensors
xinference  | llama_model_loader: - type q3_K:  321 tensors
xinference  | llama_model_loader: - type q4_K:  155 tensors
xinference  | llama_model_loader: - type q5_K:   85 tensors
xinference  | llama_model_loader: - type q6_K:    1 tensors
xinference  | llm_load_vocab: special tokens definition check successful ( 421/152064 ).
xinference  | llm_load_print_meta: format           = GGUF V3 (latest)
xinference  | llm_load_print_meta: arch             = qwen2
xinference  | llm_load_print_meta: vocab type       = BPE
xinference  | llm_load_print_meta: n_vocab          = 152064
xinference  | llm_load_print_meta: n_merges         = 151387
xinference  | llm_load_print_meta: n_ctx_train      = 32768
xinference  | llm_load_print_meta: n_embd           = 8192
xinference  | llm_load_print_meta: n_head           = 64
xinference  | llm_load_print_meta: n_head_kv        = 64
xinference  | llm_load_print_meta: n_layer          = 80
xinference  | llm_load_print_meta: n_rot            = 128
xinference  | llm_load_print_meta: n_embd_head_k    = 128
xinference  | llm_load_print_meta: n_embd_head_v    = 128
xinference  | llm_load_print_meta: n_gqa            = 1
xinference  | llm_load_print_meta: n_embd_k_gqa     = 8192
xinference  | llm_load_print_meta: n_embd_v_gqa     = 8192
xinference  | llm_load_print_meta: f_norm_eps       = 0.0e+00
xinference  | llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
xinference  | llm_load_print_meta: f_clamp_kqv      = 0.0e+00
xinference  | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
xinference  | llm_load_print_meta: n_ff             = 24576
xinference  | llm_load_print_meta: n_expert         = 0
xinference  | llm_load_print_meta: n_expert_used    = 0
xinference  | llm_load_print_meta: pooling type     = 0
xinference  | llm_load_print_meta: rope type        = 2
xinference  | llm_load_print_meta: rope scaling     = linear
xinference  | llm_load_print_meta: freq_base_train  = 1000000.0
xinference  | llm_load_print_meta: freq_scale_train = 1
xinference  | llm_load_print_meta: n_yarn_orig_ctx  = 32768
xinference  | llm_load_print_meta: rope_finetuned   = unknown
xinference  | llm_load_print_meta: ssm_d_conv       = 0
xinference  | llm_load_print_meta: ssm_d_inner      = 0
xinference  | llm_load_print_meta: ssm_d_state      = 0
xinference  | llm_load_print_meta: ssm_dt_rank      = 0
xinference  | llm_load_print_meta: model type       = 70B
xinference  | llm_load_print_meta: model ftype      = Q3_K - Medium
xinference  | llm_load_print_meta: model params     = 72.29 B
xinference  | llm_load_print_meta: model size       = 33.45 GiB (3.98 BPW)
xinference  | llm_load_print_meta: general.name     = Qwen1.5-72B-Chat-AWQ-fp16
xinference  | llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
xinference  | llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
xinference  | llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
xinference  | llm_load_print_meta: LF token         = 148848 'ÄĬ'
xinference  | llm_load_tensors: ggml ctx size =    0.74 MiB
xinference  | llm_load_tensors: offloading 80 repeating layers to GPU
xinference  | llm_load_tensors: offloading non-repeating layers to GPU
xinference  | llm_load_tensors: offloaded 81/81 layers to GPU
xinference  | llm_load_tensors:  CUDA_Host buffer size =   510.47 MiB
xinference  | llm_load_tensors:      CUDA0 buffer size = 33747.06 MiB
xinference  | ....................................2024-03-28 01:35:30,477 xinference.core.supervisor 202 DEBUG    Enter launch_builtin_model, model_uid: qwen1.5-72b-q3, model_name: qwen1.5-chat-offline, model_size: 72, model_format: ggufv2, quantization: q3_k_m, replica: 1
xinference  | 2024-03-28 01:35:30,480 xinference.api.restful_api 1 ERROR    [address=0.0.0.0:53955, pid=202] Model is already in the model list, uid: qwen1.5-72b-q3
xinference  | Traceback (most recent call last):
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xinference/api/restful_api.py", line 722, in launch_model
xinference  |     model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
xinference  |     return self._process_result_message(result)
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
xinference  |     raise message.as_instanceof_cause()
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send
xinference  |     result = await self._run_coro(message.message_id, coro)
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
xinference  |     return await coro
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
xinference  |     return await super().__on_receive__(message)  # type: ignore
xinference  |   File "xoscar/core.pyx", line 558, in __on_receive__
xinference  |     raise ex
xinference  |   File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
xinference  |     async with self._lock:
xinference  |   File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
xinference  |     with debug_async_timeout('actor_lock_timeout',
xinference  |   File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
xinference  |     result = await result
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xinference/core/supervisor.py", line 780, in launch_builtin_model
xinference  |     raise ValueError(f"Model is already in the model list, uid: {model_uid}")
xinference  | ValueError: [address=0.0.0.0:53955, pid=202] Model is already in the model list, uid: qwen1.5-72b-q3
xinference  | ..............................................................
xinference  | llama_new_context_with_model: n_ctx      = 32768
xinference  | llama_new_context_with_model: freq_base  = 1000000.0
xinference  | llama_new_context_with_model: freq_scale = 1
xinference  | 2024-03-28 01:37:54,732 xinference.core.supervisor 202 DEBUG    Enter list_model_registrations, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f3831ffda80>, 'LLM'), kwargs: {'detailed': True}
xinference  | 2024-03-28 01:37:56,346 xinference.core.supervisor 202 DEBUG    Leave list_model_registrations, elapsed time: 1 s
xinference  | 2024-03-28 01:37:58,053 xinference.core.supervisor 202 DEBUG    Enter list_model_registrations, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f3831ffda80>, 'rerank'), kwargs: {'detailed': False}
xinference  | 2024-03-28 01:37:58,055 xinference.core.supervisor 202 DEBUG    Leave list_model_registrations, elapsed time: 0 s
xinference  | 2024-03-28 01:37:58,109 xinference.core.supervisor 202 DEBUG    Enter list_model_registrations, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f3831ffda80>, 'embedding'), kwargs: {'detailed': False}
xinference  | 2024-03-28 01:37:58,111 xinference.core.supervisor 202 DEBUG    Leave list_model_registrations, elapsed time: 0 s
xinference  | 2024-03-28 01:37:58,130 xinference.core.supervisor 202 DEBUG    Enter list_model_registrations, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f3831ffda80>, 'LLM'), kwargs: {'detailed': False}
xinference  | 2024-03-28 01:37:58,132 xinference.core.supervisor 202 DEBUG    Leave list_model_registrations, elapsed time: 0 s
xinference  | 2024-03-28 01:37:58,150 xinference.core.supervisor 202 DEBUG    Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f3831ffda80>, 'LLM', 'chatglm2-offline'), kwargs: {}
xinference  | 2024-03-28 01:37:58,151 xinference.core.supervisor 202 DEBUG    Leave get_model_registration, elapsed time: 0 s
xinference  | 2024-03-28 01:37:58,155 xinference.core.supervisor 202 DEBUG    Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f3831ffda80>, 'LLM', 'chatglm3-offline'), kwargs: {}
xinference  | 2024-03-28 01:37:58,157 xinference.core.supervisor 202 DEBUG    Leave get_model_registration, elapsed time: 0 s
xinference  | 2024-03-28 01:37:58,160 xinference.core.supervisor 202 DEBUG    Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f3831ffda80>, 'LLM', 'qwen1.5-chat-offline'), kwargs: {}
xinference  | 2024-03-28 01:37:58,162 xinference.core.supervisor 202 DEBUG    Leave get_model_registration, elapsed time: 0 s
xinference  | 2024-03-28 01:38:01,173 xinference.core.supervisor 202 DEBUG    Enter unregister_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f3831ffda80>, 'LLM', 'qwen1.5-chat-offline'), kwargs: {}
xinference  | 2024-03-28 01:38:01,186 xinference.core.supervisor 202 DEBUG    Leave unregister_model, elapsed time: 0 s
xinference  | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 81920.00 MiB on device 0: cudaMalloc failed: out of memory
xinference  | llama_kv_cache_init: failed to allocate buffer for kv cache
xinference  | llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
xinference  | 2024-03-28 01:38:02,330 xinference.core.worker 202 ERROR    Failed to load model qwen1.5-72b-q3-1-0
xinference  | Traceback (most recent call last):
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xinference/core/worker.py", line 569, in launch_builtin_model
xinference  |     await model_ref.load()
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
xinference  |     return self._process_result_message(result)
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
xinference  |     raise message.as_instanceof_cause()
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send
xinference  |     result = await self._run_coro(message.message_id, coro)
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
xinference  |     return await coro
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
xinference  |     return await super().__on_receive__(message)  # type: ignore
xinference  |   File "xoscar/core.pyx", line 558, in __on_receive__
xinference  |     raise ex
xinference  |   File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
xinference  |     async with self._lock:
xinference  |   File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
xinference  |     with debug_async_timeout('actor_lock_timeout',
xinference  |   File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.__on_receive__
xinference  |     result = func(*args, **kwargs)
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 239, in load
xinference  |     self._model.load()
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/ggml/llamacpp.py", line 171, in load
xinference  |     self._llm = Llama(
xinference  |   File "/opt/conda/lib/python3.10/site-packages/llama_cpp/llama.py", line 325, in __init__
xinference  |     self._ctx = _LlamaContext(
xinference  |   File "/opt/conda/lib/python3.10/site-packages/llama_cpp/_internals.py", line 265, in __init__
xinference  |     raise ValueError("Failed to create llama_context")
xinference  | ValueError: [address=0.0.0.0:38499, pid=258] Failed to create llama_context
xinference  | 2024-03-28 01:38:05,054 xinference.core.supervisor 202 DEBUG    Enter terminate_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f3831ffda80>, 'qwen1.5-72b-q3'), kwargs: {'suppress_exception': True}
xinference  | 2024-03-28 01:38:05,056 xinference.core.supervisor 202 DEBUG    Leave terminate_model, elapsed time: 0 s
xinference  | 2024-03-28 01:38:05,066 xinference.api.restful_api 1 ERROR    [address=0.0.0.0:38499, pid=258] Failed to create llama_context
xinference  | Traceback (most recent call last):
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xinference/api/restful_api.py", line 722, in launch_model
xinference  |     model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
xinference  |     return self._process_result_message(result)
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
xinference  |     raise message.as_instanceof_cause()
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send
xinference  |     result = await self._run_coro(message.message_id, coro)
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
xinference  |     return await coro
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
xinference  |     return await super().__on_receive__(message)  # type: ignore
xinference  |   File "xoscar/core.pyx", line 558, in __on_receive__
xinference  |     raise ex
xinference  |   File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
xinference  |     async with self._lock:
xinference  |   File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
xinference  |     with debug_async_timeout('actor_lock_timeout',
xinference  |   File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
xinference  |     result = await result
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xinference/core/supervisor.py", line 796, in launch_builtin_model
xinference  |     await _launch_model()
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xinference/core/supervisor.py", line 760, in _launch_model
xinference  |     await _launch_one_model(rep_model_uid)
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xinference/core/supervisor.py", line 741, in _launch_one_model
xinference  |     await worker_ref.launch_builtin_model(
xinference  |   File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper
xinference  |     async with lock:
xinference  |   File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
xinference  |     result = await result
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped
xinference  |     ret = await func(*args, **kwargs)
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xinference/core/worker.py", line 569, in launch_builtin_model
xinference  |     await model_ref.load()
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
xinference  |     return self._process_result_message(result)
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
xinference  |     raise message.as_instanceof_cause()
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send
xinference  |     result = await self._run_coro(message.message_id, coro)
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
xinference  |     return await coro
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
xinference  |     return await super().__on_receive__(message)  # type: ignore
xinference  |   File "xoscar/core.pyx", line 558, in __on_receive__
xinference  |     raise ex
xinference  |   File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
xinference  |     async with self._lock:
xinference  |   File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
xinference  |     with debug_async_timeout('actor_lock_timeout',
xinference  |   File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.__on_receive__
xinference  |     result = func(*args, **kwargs)
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 239, in load
xinference  |     self._model.load()
xinference  |   File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/ggml/llamacpp.py", line 171, in load
xinference  |     self._llm = Llama(
xinference  |   File "/opt/conda/lib/python3.10/site-packages/llama_cpp/llama.py", line 325, in __init__
xinference  |     self._ctx = _LlamaContext(
xinference  |   File "/opt/conda/lib/python3.10/site-packages/llama_cpp/_internals.py", line 265, in __init__
xinference  |     raise ValueError("Failed to create llama_context")
xinference  | ValueError: [address=0.0.0.0:38499, pid=258] Failed to create llama_context

尝试修改参数

curl 'http://localhost:9997/v1/models' \
  -H 'Content-Type: application/json' \
  -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0' \
  --data-raw '{"model_uid":"qwen1.5-72b-q4","model_name":"qwen1.5-chat-offline","model_format":"ggufv2","model_size_in_billions":72,"quantization":"q4_k_m","n_gpu":"2","replica":1}'

报错,不支持设置n_gpu

{"detail":"[address=0.0.0.0:53955, pid=202] Currently `n_gpu` only supports `auto`."}
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
2

Describe the solution you'd like

  1. llama.cpp 已支持多GPU Multi GPU support, CUDA refactor, CUDA scratch buffer ggerganov/llama.cpp#1703
  2. 使用LM Studio测试GTX3090*2加载qwen-72b q4可以正常加载
@XprobeBot XprobeBot added the gpu label Mar 28, 2024
@XprobeBot XprobeBot added this to the v0.9.5 milestone Mar 28, 2024
@qinxuye
Copy link
Contributor

qinxuye commented Mar 28, 2024

谢谢,既然 llama.cpp 已经支持多 GPU,我们会尽快实现。

@qinxuye qinxuye changed the title 希望GGUF模型开放多GPU支持 ENH: multiple GPU for llama.cpp engine Mar 28, 2024
@XprobeBot XprobeBot added the enhancement New feature or request label Mar 28, 2024
@qinxuye qinxuye changed the title ENH: multiple GPU for llama.cpp engine ENH: multiple GPU support for llama.cpp engine Mar 28, 2024
@XprobeBot XprobeBot modified the milestones: v0.10.0, v0.10.1 Mar 29, 2024
@amumu96
Copy link
Contributor

amumu96 commented Apr 7, 2024

n_gpu accpet int paramters, if you send a string, it would occur error, you can use as this:

curl 'http://localhost:9997/v1/models' \
  -H 'Content-Type: application/json' \
  -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0' \
  --data-raw '{"model_uid":"qwen1.5-72b-q4","model_name":"qwen1.5-chat-offline","model_format":"ggufv2","model_size_in_billions":72,"quantization":"q4_k_m","n_gpu":2,"replica":1}'

then you can use 2 gpu to launch model
@FalconIA

@ChengjieLi28
Copy link
Contributor

Xinference can support this now. n_gpu needs to be a int value.

@amumu96
Copy link
Contributor

amumu96 commented Apr 9, 2024

I guess it was because your gpu haven't enough memory

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 41984.00 MiB on device 0: cudaMalloc failed: out of memory

you can try to set n_gpu_layers to assign suitable layer on gpu @FalconIA

@FalconIA
Copy link
Author

FalconIA commented Apr 9, 2024

I guess it was because your gpu haven't enough memory

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 41984.00 MiB on device 0: cudaMalloc failed: out of memory

you can try to set n_gpu_layers to assign suitable layer on gpu @FalconIA

I tried another model (qwen1.5-32b q8_0). It works.
Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gpu
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

5 participants