-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vllm hangs when reinitializing ray #1058
Comments
maybe you can try insert |
same problem, any solution? |
I encountered the same issue. It runs fine when I use The final solution for me was to modify the Note: I encountered hanging issues while using |
@Fenkail Hi, may I ask how do you decide the number of CPUs limit? I am running exactly the same issue as OP |
Hi @Fenkail , I already modified the "ray_utils.py" as you suggested but the problem is still there. In fact, my pc has only two GPUs. So, I'd like to know how you choose num_cpus and num_gpus to fix the problem? |
I just tried using 32 cores and it solved my problem. The specific number of CPU cores can be adjusted according to your needs. It was working fine on a machine with 96 cores, but I encountered issues on a 128-core machine, so I thought of limiting the CPU usage. |
Did you modify the ray_utils.py installed in the conda environment for vllm? |
Yes, I did modify ray_utils.py, installed in my conda environment for vllm |
Hit the exact same issue when running vLLM in Ray serve. |
In my case I have 4 GPUs and 3 RayServe deployments, 2 of which require 1 logical GPU with tensor_parallelism=1 and another one which requires 2 logical GPUs with tensor_parallelism=2. Looks like when vLLM tries to handle the tensor_parallelism=2 it got stuck because of not enough resources.
|
you should load model outside the function to keep model only load once from vllm import LLM, SamplingParams llm = LLM( def process_prompts(prompts): prompt_batch_1 = ["Hello, my name is", "The president of the United States is"] batch_1_output = process_prompts(prompt_batch_1) |
I just tried using 32 cores and it solved my problem. |
I am still having the problem
Resources
---------------------------------------------------------------
Usage:
10.0/24.0 CPU
0.8999999999999999/4.0 GPU
0B/200.20GiB memory
44B/89.79GiB object_store_memory
Demands:
{'CPU': 12.0}: 1+ pending tasks/actors
message: 'Deployment ''vllmAPI'' in application ''ray vllm application''
1 replicas that have taken more than 30s to be scheduled. This may be due
to waiting for the cluster to auto-scale or for a runtime environment to
be installed. Resources required for each replica: {"CPU": 12.0}, total
resources available: {"CPU": 14.0}. Use `ray status` for more details.' Edit: Problem solved. |
Fenkail's solution for setting the 'num_cpus' parameter up to a correct amount (i.e. 10 out of 10 available in my case) solved my problem. A fix for slurm jobs:
|
I also fix the problem by setting the ray num cpus to 32. |
#1908 might be related, but in 'Offline Batched Inference' mode. |
Hey folks had a similar issue, I'm running with offline inference mode. I was able to clear the resource with |
Hi @DarkLight1337, is there any update for the bug ? I have also the same problem when reload a model in api infrence. Firstly, when I run api code, everything is fine, loading is ok. If I try directly reload a model, I get:
And nothing happens. If I check ray status and shutdown the ray cluster and reload a model, I get:
It seems it connects and loads the model again but it does not load and gets this error:
|
I was just triaging the issues. I'm not that involved with the use of Ray in vLLM so I won't be of much assistance here. |
We have added documentation for this situation in #5430. Please take a look. |
I'd like to be able to unload a vllm model and re-load it later, in the same script. However, the following (on 0.1.7) causes the script to hang (disclaimer: this isn't my particular workload, but a minimal reproducible example):
Results in:
Then, it just hangs forever (been waiting 10 minutes, with no sign of life). Checking the GPUs shows that the model is indeed unloaded from the GPUs.
I'm fairly sure this is related to ray, since this doesn't happen if tensor parallelism is set to 1 (e.g., if you're running a smaller model). When I ctrl+c out of the script after it hangs, it shows that it's stuck on
ray.get(current_placement_group.ready(), timeout=1800)
https://github.com/vllm-project/vllm/blob/main/vllm/engine/ray_utils.py#L112C9-L112C63 .Is there any way to "reset" the ray state, such that it initializes from scratch the second time?
The text was updated successfully, but these errors were encountered: