vllm hangs when reinitializing ray #1058

nelson-liu · 2023-09-15T19:03:58Z

I'd like to be able to unload a vllm model and re-load it later, in the same script. However, the following (on 0.1.7) causes the script to hang (disclaimer: this isn't my particular workload, but a minimal reproducible example):

from vllm import LLM, SamplingParams

def process_prompts(prompts):
    llm = LLM(
        model="meta-llama/Llama-2-70b-chat-hf",
        tensor_parallel_size=2,
        trust_remote_code=True,
        load_format="pt")
    sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=500)
    return llm.generate(prompts, sampling_params)

prompt_batch_1 = ["Hello, my name is", "The president of the United States is"]
prompt_batch_2 = ["The capital of France is", "The future of AI is"]

batch_1_output = process_prompts(prompt_batch_1)
batch_2_output = process_prompts(prompt_batch_2)

Results in:

2023-09-15 11:43:25,943 INFO worker.py:1621 -- Started a local Ray instance.
INFO 09-15 11:43:51 llm_engine.py:72] Initializing an LLM engine with config: model='meta-llama/Llama-2
-70b-chat-hf', tokenizer='meta-llama/Llama-2-70b-chat-hf', tokenizer_mode=auto, trust_remote_code=True,
 dtype=torch.float16, download_dir='/scr/biggest/nfliu/cache/huggingface/', load_format=pt, tensor_para
llel_size=2, seed=0)
INFO 09-15 11:43:51 tokenizer.py:30] For some LLaMA-based models, initializing the fast tokenizer may t
ake a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokeni
zer' instead of the original tokenizer.
INFO 09-15 11:45:58 llm_engine.py:199] # GPU blocks: 2561, # CPU blocks: 1638
Processed prompts: 100%|█████████████████████████████████████████████████| 2/2 [00:14<00:00,  7.17s/it]
2023-09-15 11:46:28,348 INFO worker.py:1453 -- Calling ray.init() again after it has already been called.

Then, it just hangs forever (been waiting 10 minutes, with no sign of life). Checking the GPUs shows that the model is indeed unloaded from the GPUs.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:C7:00.0 Off |                    0 |
| N/A   30C    P0              61W / 350W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:CA:00.0 Off |                    0 |
| N/A   31C    P0              57W / 350W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I'm fairly sure this is related to ray, since this doesn't happen if tensor parallelism is set to 1 (e.g., if you're running a smaller model). When I ctrl+c out of the script after it hangs, it shows that it's stuck on ray.get(current_placement_group.ready(), timeout=1800) https://github.com/vllm-project/vllm/blob/main/vllm/engine/ray_utils.py#L112C9-L112C63 .

Is there any way to "reset" the ray state, such that it initializes from scratch the second time?

The text was updated successfully, but these errors were encountered:

hsm1997 · 2023-09-16T07:13:10Z

maybe you can try insert os.system("ray stop --force") somewhere between unload and reload

raihan0824 · 2023-09-19T11:41:32Z

same problem, any solution?

Fenkail · 2023-10-18T09:50:25Z

I encountered the same issue. It runs fine when I use tensor_parallel_size=1, but it hangs when I use tensor_parallel_size>1 . I have tried reinstalling many times but it didn't help.

The final solution for me was to modify the vllm/engine/ray_utils.py file and limit the number of CPUs used. After making this change, it works properly. The modified code is:
ray.init(num_cpus=32, num_gpus=4, address=ray_address, ignore_reinit_error=True).

Note: I encountered hanging issues while using tensor_parallel_size>1 on a 128-core machine. However, running tensor_parallel_size>1 on a 96-core machine works normally

yichenjm · 2023-11-02T06:52:22Z

@Fenkail Hi, may I ask how do you decide the number of CPUs limit? I am running exactly the same issue as OP

pvtoan · 2023-11-07T07:23:09Z

Hi @Fenkail , I already modified the "ray_utils.py" as you suggested but the problem is still there.

In fact, my pc has only two GPUs. So, I'd like to know how you choose num_cpus and num_gpus to fix the problem?

Fenkail · 2023-11-14T09:30:17Z

@Fenkail Hi, may I ask how do you decide the number of CPUs limit? I am running exactly the same issue as OP

I just tried using 32 cores and it solved my problem. The specific number of CPU cores can be adjusted according to your needs. It was working fine on a machine with 96 cores, but I encountered issues on a 128-core machine, so I thought of limiting the CPU usage.

Fenkail · 2023-11-14T09:42:22Z

ray_utils

Did you modify the ray_utils.py installed in the conda environment for vllm?

pvtoan · 2023-11-14T09:46:06Z

Yes, I did modify ray_utils.py, installed in my conda environment for vllm

qizzzh · 2023-12-08T05:03:36Z

Hit the exact same issue when running vLLM in Ray serve.

qizzzh · 2023-12-08T06:12:52Z

In my case I have 4 GPUs and 3 RayServe deployments, 2 of which require 1 logical GPU with tensor_parallelism=1 and another one which requires 2 logical GPUs with tensor_parallelism=2. Looks like when vLLM tries to handle the tensor_parallelism=2 it got stuck because of not enough resources.

Resources
---------------------------------------------------------------
Usage:
 17.0/48.0 CPU
 4.0/4.0 GPU
 0B/104.83GiB memory
 44B/48.92GiB object_store_memory

Demands:
 {'GPU': 1.0} * 2 (PACK): 1+ pending placement groups

smallmocha · 2023-12-20T12:04:46Z

you should load model outside the function to keep model only load once

from vllm import LLM, SamplingParams

llm = LLM(
model="meta-llama/Llama-2-70b-chat-hf",
tensor_parallel_size=2,
trust_remote_code=True,
load_format="pt")

def process_prompts(prompts):
sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=500)
return llm.generate(prompts, sampling_params)

prompt_batch_1 = ["Hello, my name is", "The president of the United States is"]
prompt_batch_2 = ["The capital of France is", "The future of AI is"]

batch_1_output = process_prompts(prompt_batch_1)
batch_2_output = process_prompts(prompt_batch_2)

Dolfik1 · 2024-01-11T11:12:24Z

TLDR: Don't set the num_gpus value for vLLM, only set tensor_parallel_size.

I encountered the same problem, and here's what I found out:

According to Ray's documentation, the framework itself will allocate the necessary GPUs (based on num_gpus) and set the CUDA_VISIBLE_DEVICES value. When I started running vLLM through Ray, I found that Ray sets the CUDA_VISIBLE_DEVICES value to 0,1,2,3 (I had num_gpus = 4 specified), however, when I called nvidia-smi, I found that vLLM uses 4,5,6,7. Therefore, vLLM ignores the CUDA_VISIBLE_DEVICES value and chooses other devices.
In my case, I have 8 GPUs, I allocated 4 for vLLM, and 2 for another model, leaving 2 free. But vLLM requested another 4 GPUs, and since Ray couldn't satisfy this request, vLLM started waiting for the GPUs to free up. As soon as I removed the second model, which requested 2 GPUs, everything started working. Everything also worked when I allocated 2 GPUs for the vLLM model.
If you try to run two identical applications using vLLM through Ray (in one serve), everything will break. The applications will not use different GPUs, but will start loading data into the memory of the same GPUs, while the other GPUs will be idle. Ultimately, this will lead to OOM. I believe this is related to the incorrect handling of the num_gpus value.

hwaking · 2024-01-16T01:20:26Z

I just tried using 32 cores and it solved my problem.

paolovic · 2024-03-11T21:36:52Z

I am still having the problem
I want to deploy one model with tensor_parallel_size=2 (just 1 replica), one model with num_gpus=0.4 (with 2 replicas, so in total 0.8 GPUs), and one model with num_gpus=0.1 (with 1 replica)
In total, this would require 2.9 GPUs which is ok since I have 3 GPUs each with sufficient VRAM on this node alone at hand.

ray status returns

Resources
---------------------------------------------------------------
Usage:
 10.0/24.0 CPU
 0.8999999999999999/4.0 GPU
 0B/200.20GiB memory
 44B/89.79GiB object_store_memory

Demands:
 {'CPU': 12.0}: 1+ pending tasks/actors

serve status returns

        message: 'Deployment ''vllmAPI'' in application ''ray vllm application''
          1 replicas that have taken more than 30s to be scheduled. This may be due
          to waiting for the cluster to auto-scale or for a runtime environment to
          be installed. Resources required for each replica: {"CPU": 12.0}, total
          resources available: {"CPU": 14.0}. Use `ray status` for more details.'

Edit: Problem solved.
As adviced above, for the one model with tensor_parallel_size=2, I defined num_gpus=0
For the others, as written above, one model with num_gpus=0.4 (with 2 replicas, so in total 0.8 GPUs), and one model with num_gpus=0.1 (with 1 replica) BUT at the same time, I also defined for these others CUDA_VISIBLE_DEVICES=0,1,2,3 (since I have 4 GPUs) and then it was able to spin up properly

premsa · 2024-03-14T13:06:22Z

Fenkail's solution for setting the 'num_cpus' parameter up to a correct amount (i.e. 10 out of 10 available in my case) solved my problem. A fix for slurm jobs:

num_cpus = int(os.environ.get('SLURM_CPUS_PER_TASK'))

panxnan · 2024-04-11T09:56:52Z

I also fix the problem by setting the ray num cpus to 32.
ray start --head --num-cpus=32
It also works when I set cpus to 49 (since I have two pysical cpus, each have 48 cores)

shyringo · 2024-04-24T07:55:12Z

#1908 might be related, but in 'Offline Batched Inference' mode.

Vincent-Li-9701 · 2024-05-20T22:15:57Z

Hey folks had a similar issue, I'm running with offline inference mode. I was able to clear the resource with ray stop But when I try to reload the resource I got
[2024-05-20 22:14:06,214 E 1068826 1069198] gcs_rpc_client.h:554: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure. The program will terminate. Does anyone know how to properly restart?

emirhanKural · 2024-06-05T13:45:51Z

Hi @DarkLight1337, is there any update for the bug ? I have also the same problem when reload a model in api infrence.

Firstly, when I run api code, everything is fine, loading is ok.

If I try directly reload a model, I get:

2024-06-05 16:32:47,026 WARNING worker.py:1419 -- SIGTERM handler is not set because current thread is not the main thread.
2024-06-05 16:32:47,035 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 10.187.##.##:##...
2024-06-05 16:32:47,035 INFO worker.py:1582 -- Calling ray.init() again after it has already been called.`

And nothing happens.

If I check ray status and shutdown the ray cluster and reload a model, I get:

if ray.is_initialized():
    ray.shutdown()

new_model = AsyncLLMEngine.from_engine_args(engine_args, usage_context=UsageContext.API_SERVER)

2024-06-05 16:39:43,757 WARNING worker.py:1419 -- SIGTERM handler is not set because current thread is not the main thread.
2024-06-05 16:39:43,766 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 10.187.##.##:##...
2024-06-05 16:39:43,766 INFO worker.py:1582 -- Calling ray.init() again after it has already been called.
INFO 06-05 16:39:44 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: .......

It seems it connects and loads the model again but it does not load and gets this error:

(RayWorkerWrapper pid=3562842) [E socket.cpp:957] [c10d] The client socket has timed out after 600s while trying to connect to (10.187.##.##:##..., 44163)

DarkLight1337 · 2024-06-05T14:04:09Z

I was just triaging the issues. I'm not that involved with the use of Ray in vLLM so I won't be of much assistance here.

DarkLight1337 · 2024-06-13T09:04:29Z

We have added documentation for this situation in #5430. Please take a look.

qizzzh mentioned this issue Dec 8, 2023

Tensor parallelism on ray cluster #1566

Closed

nylocx mentioned this issue Dec 12, 2023

GPU has fallen off the bus running in docker with more than one GPU h2oai/h2ogpt#1195

Closed

jieguangzhou mentioned this issue Jan 5, 2024

Remove vllm dependency when using ray to run vllm superduper-io/superduper#1637

Merged

5 tasks

Dolfik1 mentioned this issue Jan 11, 2024

ray serve get stuck when loading two or more applications asprenger/ray_vllm_inference#3

Open

ToheartZhang mentioned this issue Mar 18, 2024

How to run multiple mixtral in one machine #3406

Closed

saattrupdan mentioned this issue Apr 4, 2024

[BUG] ray.exceptions.GetTimeoutError when providing multiple languages as input ScandEval/ScandEval#283

Closed

premsa added a commit to premsa/vllm that referenced this issue Apr 4, 2024

Update ray_utils.py FIX vllm-project#1058

85ca041

saattrupdan mentioned this issue Apr 21, 2024

Fix/283 ray timeout ScandEval/ScandEval#410

Merged

rkooo567 self-assigned this May 3, 2024

DarkLight1337 added the bug Something isn't working label May 31, 2024

DarkLight1337 closed this as completed Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm hangs when reinitializing ray #1058

vllm hangs when reinitializing ray #1058

nelson-liu commented Sep 15, 2023 •

edited

Loading

hsm1997 commented Sep 16, 2023 •

edited

Loading

raihan0824 commented Sep 19, 2023

Fenkail commented Oct 18, 2023 •

edited

Loading

yichenjm commented Nov 2, 2023 •

edited

Loading

pvtoan commented Nov 7, 2023

Fenkail commented Nov 14, 2023

Fenkail commented Nov 14, 2023

pvtoan commented Nov 14, 2023 •

edited

Loading

qizzzh commented Dec 8, 2023 •

edited

Loading

qizzzh commented Dec 8, 2023

smallmocha commented Dec 20, 2023

Dolfik1 commented Jan 11, 2024

hwaking commented Jan 16, 2024

paolovic commented Mar 11, 2024 •

edited

Loading

premsa commented Mar 14, 2024 •

edited

Loading

panxnan commented Apr 11, 2024

shyringo commented Apr 24, 2024

Vincent-Li-9701 commented May 20, 2024

emirhanKural commented Jun 5, 2024 •

edited

Loading

DarkLight1337 commented Jun 5, 2024

DarkLight1337 commented Jun 13, 2024

vllm hangs when reinitializing ray #1058

vllm hangs when reinitializing ray #1058

Comments

nelson-liu commented Sep 15, 2023 • edited Loading

hsm1997 commented Sep 16, 2023 • edited Loading

raihan0824 commented Sep 19, 2023

Fenkail commented Oct 18, 2023 • edited Loading

yichenjm commented Nov 2, 2023 • edited Loading

pvtoan commented Nov 7, 2023

Fenkail commented Nov 14, 2023

Fenkail commented Nov 14, 2023

pvtoan commented Nov 14, 2023 • edited Loading

qizzzh commented Dec 8, 2023 • edited Loading

qizzzh commented Dec 8, 2023

smallmocha commented Dec 20, 2023

Dolfik1 commented Jan 11, 2024

hwaking commented Jan 16, 2024

paolovic commented Mar 11, 2024 • edited Loading

premsa commented Mar 14, 2024 • edited Loading

panxnan commented Apr 11, 2024

shyringo commented Apr 24, 2024

Vincent-Li-9701 commented May 20, 2024

emirhanKural commented Jun 5, 2024 • edited Loading

DarkLight1337 commented Jun 5, 2024

DarkLight1337 commented Jun 13, 2024

nelson-liu commented Sep 15, 2023 •

edited

Loading

hsm1997 commented Sep 16, 2023 •

edited

Loading

Fenkail commented Oct 18, 2023 •

edited

Loading

yichenjm commented Nov 2, 2023 •

edited

Loading

pvtoan commented Nov 14, 2023 •

edited

Loading

qizzzh commented Dec 8, 2023 •

edited

Loading

paolovic commented Mar 11, 2024 •

edited

Loading

premsa commented Mar 14, 2024 •

edited

Loading

emirhanKural commented Jun 5, 2024 •

edited

Loading