You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Serve config file## For documentation see: # https://docs.ray.io/en/latest/serve/production-guide/config.htmlhost: 0.0.0.0port: 8000applications:
- name: demo_approute_prefix: /aimport_path: ray_vllm_inference.vllm_serve:deploymentruntime_env:
env_vars:
HUGGING_FACE_HUB_TOKEN: hf_1234pip:
- ray_vllm_inference @ git+https://github.com//asprenger/ray_vllm_inferenceargs:
model: facebook/opt-13btensor_parallel_size: 4deployments:
- name: VLLMInferencenum_replicas: 1# Maximum backlog for a single replicamax_concurrent_queries: 10ray_actor_options:
num_gpus: 4
- name: demo_app2route_prefix: /bimport_path: ray_vllm_inference.vllm_serve:deploymentruntime_env:
env_vars:
HUGGING_FACE_HUB_TOKEN: hf_1234pip:
- ray_vllm_inference @ git+https://github.com//asprenger/ray_vllm_inferenceargs:
model: facebook/opt-13btensor_parallel_size: 4deployments:
- name: VLLMInferencenum_replicas: 1# Maximum backlog for a single replicamax_concurrent_queries: 10ray_actor_options:
num_gpus: 4
I attempt to execute it using the command serve run config2.yaml. However, the deployment process stuck and never complete. Here are the logs:
[logs]
2024-01-11 12:58:28,970 INFO scripts.py:442 -- Running config file: 'config2.yaml'.
2024-01-11 12:58:30,870 INFO worker.py:1664 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
2024-01-11 12:58:33,757 SUCC scripts.py:543 -- Submitted deploy config successfully.
(ServeController pid=1450442) INFO 2024-01-11 12:58:33,752 controller 1450442 application_state.py:386 - Building application 'demo_app'.
(ServeController pid=1450442) INFO 2024-01-11 12:58:33,756 controller 1450442 application_state.py:386 - Building application 'demo_app2'.
(ProxyActor pid=1450530) INFO 2024-01-11 12:58:33,727 proxy 10.10.29.89 proxy.py:1072 - Proxy actor 4b0df404e3c5af4bd834d1ab01000000 starting on node b411128da157f5f64092128c212c0000973bfecd12b3e94b3d648495.
(ProxyActor pid=1450530) INFO 2024-01-11 12:58:33,732 proxy 10.10.29.89 proxy.py:1257 - Starting HTTP server on node: b411128da157f5f64092128c212c0000973bfecd12b3e94b3d648495 listening on port 8000
(ProxyActor pid=1450530) INFO: Started server process [1450530]
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,180 controller 1450442 application_state.py:477 - Built application 'demo_app' successfully.
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,182 controller 1450442 deployment_state.py:1379 - Deploying new version of deployment VLLMInference in application 'demo_app'.
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,284 controller 1450442 deployment_state.py:1668 - Adding 1 replica to deployment VLLMInference in application 'demo_app'.
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,302 controller 1450442 application_state.py:477 - Built application 'demo_app2' successfully.
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,304 controller 1450442 deployment_state.py:1379 - Deploying new version of deployment VLLMInference in application 'demo_app2'.
(ServeController pid=1450442) INFO 2024-01-11 12:58:42,406 controller 1450442 deployment_state.py:1668 - Adding 1 replica to deployment VLLMInference in application 'demo_app2'.
(ServeReplica:demo_app:VLLMInference pid=1468450) INFO 2024-01-11 12:58:45,015 VLLMInference demo_app#VLLMInference#WArOfC vllm_serve.py:76 - AsyncEngineArgs(model='facebook/opt-13b', tokenizer='facebook/opt-13b', tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', seed=0, max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, block_size=16, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, revision=None, tokenizer_revision=None, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
(ServeReplica:demo_app2:VLLMInference pid=1468458) INFO 2024-01-11 12:58:45,021 VLLMInference demo_app2#VLLMInference#xOjgzS vllm_serve.py:76 - AsyncEngineArgs(model='facebook/opt-13b', tokenizer='facebook/opt-13b', tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', seed=0, max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, block_size=16, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, revision=None, tokenizer_revision=None, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
(ServeReplica:demo_app:VLLMInference pid=1468450) SIGTERM handler is not set because current thread is not the main thread.
(ServeReplica:demo_app:VLLMInference pid=1468450) Calling ray.init() again after it has already been called.
(ServeController pid=1450442) WARNING 2024-01-11 12:59:12,292 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(ServeReplica:demo_app2:VLLMInference pid=1468458) SIGTERM handler is not set because current thread is not the main thread.
(ServeReplica:demo_app2:VLLMInference pid=1468458) Calling ray.init() again after it has already been called.
(ServeController pid=1450442) WARNING 2024-01-11 12:59:12,494 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app2' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=1450442) WARNING 2024-01-11 12:59:42,363 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=1450442) WARNING 2024-01-11 12:59:42,566 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app2' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=1450442) WARNING 2024-01-11 13:00:12,441 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
(ServeController pid=1450442) WARNING 2024-01-11 13:00:12,645 controller 1450442 deployment_state.py:1996 - Deployment 'VLLMInference' in application 'demo_app2' has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
Interestingly, when I disable the demo_app2 application within the configuration by commenting it out, the deployment proceeds without any issues.. I have 8 GPUs on my server, so, it should be enough for configuration provided above.
I've also attempted to create my own deployment in Python, bypassing the use of the ray_vllm_inference library, but I encountered the same problems. I noticed that the vLLM application seems to be utilizing the incorrect GPUs. When I logged the CUDA_VISIBLE_DEVICES variable in the initialization function, it displayed 0,1,2,3. However, according to nvidia-smi, vLLM is actually using GPUs 4,5,6,7.
In an attempt to troubleshoot, I created a custom deployment using the SDXL model (also two). This worked perfectly, with the model using the exact GPUs as specified in the CUDA_VISIBLE_DEVICES variable.
The text was updated successfully, but these errors were encountered:
This is my .yaml configuration file:
I attempt to execute it using the command
serve run config2.yaml
. However, the deployment process stuck and never complete. Here are the logs:[logs]
Interestingly, when I disable the demo_app2 application within the configuration by commenting it out, the deployment proceeds without any issues.. I have 8 GPUs on my server, so, it should be enough for configuration provided above.
I've also attempted to create my own deployment in Python, bypassing the use of the ray_vllm_inference library, but I encountered the same problems. I noticed that the vLLM application seems to be utilizing the incorrect GPUs. When I logged the CUDA_VISIBLE_DEVICES variable in the initialization function, it displayed 0,1,2,3. However, according to nvidia-smi, vLLM is actually using GPUs 4,5,6,7.
In an attempt to troubleshoot, I created a custom deployment using the SDXL model (also two). This worked perfectly, with the model using the exact GPUs as specified in the CUDA_VISIBLE_DEVICES variable.
The text was updated successfully, but these errors were encountered: