Multi GPU ROCm6 issues, and workarounds #2794

BKitor · 2024-02-06T22:00:17Z

I ran into a series of issues trying to get VLLM stood up on a system with multiple MI210s. I figured I'd document my issues and workarounds so that someone could pick up the baton later, or at least save someone some debugging time later.

Ray will deadlock with multiple AMD GPUs. Ray doesn't officially support AMD GPUs in v2.9; I updated Ray to nightlies (v3.0).

pip uninstall ray
pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl"

Something might have changed with how Ray exposes GPUs to workers. Only 1 GPU was exposed to each worker, so torch.cuda.set_device() with anything other than 0 would fail. I tweaked worker.py to always use 0, but I don't think this is a viable long-term fix.

diff --git a/vllm/worker/worker.py b/vllm/worker/worker.py
index c97e82a..a63fbd9 100644
--- a/vllm/worker/worker.py
+++ b/vllm/worker/worker.py
@@ -68,6 +68,7 @@ class Worker:
         self.gpu_cache = None

     def init_model(self) -> None:
+        print(f"***** local_rank {self.local_rank} hit init_model, is_driver: {self.is_driver_worker} *****")
         if self.device_config.device.type == "cuda":
             # torch.distributed.all_reduce does not free the input tensor until
             # the synchronization point. This causes the memory usage to grow
@@ -80,7 +81,9 @@ class Worker:
             # This env var set by Ray causes exceptions with graph building.
             os.environ.pop("NCCL_ASYNC_ERROR_HANDLING", None)
             self.device = torch.device(f"cuda:{self.local_rank}")
-            torch.cuda.set_device(self.device)
+            print(f"***** trying to set dev {self.device} of {torch.cuda.device_count()} is_driver: {self.is_driver_worker} *****")
+            # torch.cuda.set_device(self.device)
+            torch.cuda.set_device(0)

             _check_if_gpu_supports_dtype(self.model_config.dtype)
         else:

The text was updated successfully, but these errors were encountered:

SuperBruceJia · 2024-06-10T02:17:41Z

@BKitor Have you found any solution for distributed inference? Thank you very much in advance!

Best regards,

Shuyue
June 9th, 2024

BKitor · 2024-06-10T20:36:20Z

Sorry, haven't poked this in a while (lost access to multi-node system).
But for single-node multi-gpu training, the 'mp' distributed_execution_backend has been fairly stable.

SuperBruceJia · 2024-06-10T20:41:59Z

@BKitor Benjamin, I am using single-node multi-GPUs but there is a problem regarding the init_device (https://github.com/vllm-project/vllm/blob/main/vllm/worker/worker.py#L92-L118).

Do you have any idea how to solve it?

Thank you very much, and have a nice day!

2024-06-10 16:36:06,142 INFO worker.py:1568 -- Connecting to existing Ray cluster at address: 192.168.19.245:6379...
2024-06-10 16:36:06,142 INFO worker.py:1586 -- Calling ray.init() again after it has already been called.
INFO 06-10 16:36:06 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='./save_folder', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=./save_folder)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(pid=487121) /usr4/ec523/brucejia/.local/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
(pid=487121)   warnings.warn(
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] Traceback (most recent call last):
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]   File "/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 141, in execute_method
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]   File "/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 106, in init_device
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]     torch.cuda.set_device(self.device)
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]   File "/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 399, in set_device
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]     torch._C._cuda_setDevice(device)
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]

Best regards,

Shuyue
June 10th, 2024

BKitor · 2024-06-10T20:52:44Z

What I'm suggesting is to not use ray.
One of the arguments when instantiating a model is distributed_execution_backend, where the options include 'ray' or 'mp'.
I'm not sure how you're launching your model, you might have to insert distrubted_execution_backend="mp" where you create the llm. i.e. from vllm import LLM; llm = LLM(<whatever your args already are>, distrubted_execution_backend="mp).
Otherwise, some of the provided helper scripts let you specify --distributed-execution-backend on the command line, but this isn't universal to YMYV.

SuperBruceJia · 2024-06-10T21:35:41Z

What I'm suggesting is to not use ray. One of the arguments when instantiating a model is distributed_execution_backend, where the options include 'ray' or 'mp'. I'm not sure how you're launching your model, you might have to insert distrubted_execution_backend="mp" where you create the llm. i.e. from vllm import LLM; llm = LLM(<whatever your args already are>, distrubted_execution_backend="mp). Otherwise, some of the provided helper scripts let you specify --distributed-execution-backend on the command line, but this isn't universal to YMYV.

@BKitor Benjamin, it seems that there is no distrubted_execution_backend argument in LLM: https://github.com/vllm-project/vllm/blob/main/vllm/engine/llm_engine.py. However, there is one in the AsyncLLMEngine: https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py, which is for serving.

May I know which vLLM version you are using?

Thank you very much, and have a nice day!

Best regards,

Shuyue
June 10th, 2024

BKitor · 2024-06-10T22:22:07Z

The file you're looking for is args_util.py, and it's present in 0.4.3 https://github.com/vllm-project/vllm/blob/1197e02141df1a7442f21ff6922c98ec0bba153e/vllm/engine/arg_utils.py#L38

…

On Mon, Jun 10, 2024 at 2:36 PM Shuyue Jia ***@***.***> wrote: What I'm suggesting is to not use ray. One of the arguments when instantiating a model is distributed_execution_backend, where the options include 'ray' or 'mp'. I'm not sure how you're launching your model, you might have to insert distrubted_execution_backend="mp" where you create the llm. i.e. from vllm import LLM; llm = LLM(<whatever your args already are>, distrubted_execution_backend="mp). Otherwise, some of the provided helper scripts let you specify --distributed-execution-backend on the command line, but this isn't universal to YMYV. @BKitor <https://github.com/BKitor> Benjamin, it seems that there is no distrubted_execution_backend argument in LLM: https://github.com/vllm-project/vllm/blob/main/vllm/engine/llm_engine.py. However, there is one in the AsyncLLMEngine: https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py, which is for serving. May I know which vLLM version you are using? Thank you very much, and have a nice day! Best regards, Shuyue June 10th, 2024 — Reply to this email directly, view it on GitHub <#2794 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEAZJUXNV36SANY7YRYL5OTZGYL4FAVCNFSM6AAAAABC4X7FNGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJZGMZDIMRZGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

SuperBruceJia · 2024-06-11T04:31:44Z

The file you're looking for is args_util.py, and it's present in 0.4.3

vllm/vllm/engine/arg_utils.py

Line 38 in 1197e02

distributed_executor_backend: Optional[str] = None

…
On Mon, Jun 10, 2024 at 2:36 PM Shuyue Jia @.> wrote: What I'm suggesting is to not use ray. One of the arguments when instantiating a model is distributed_execution_backend, where the options include 'ray' or 'mp'. I'm not sure how you're launching your model, you might have to insert distrubted_execution_backend="mp" where you create the llm. i.e. from vllm import LLM; llm = LLM(, distrubted_execution_backend="mp). Otherwise, some of the provided helper scripts let you specify --distributed-execution-backend on the command line, but this isn't universal to YMYV. @BKitor https://github.com/BKitor Benjamin, it seems that there is no distrubted_execution_backend argument in LLM: https://github.com/vllm-project/vllm/blob/main/vllm/engine/llm_engine.py. However, there is one in the AsyncLLMEngine: https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py, which is for serving. May I know which vLLM version you are using? Thank you very much, and have a nice day! Best regards, Shuyue June 10th, 2024 — Reply to this email directly, view it on GitHub <#2794 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEAZJUXNV36SANY7YRYL5OTZGYL4FAVCNFSM6AAAAABC4X7FNGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJZGMZDIMRZGU . You are receiving this because you were mentioned.Message ID: @.>

Thank you very much, Benjamin! It really helps.

Now, my multi-gpu inference can be running smoothly. For other researchers' reference:

I use vLLM 0.4.3: https://github.com/vllm-project/vllm/releases/tag/v0.4.3

llm = LLM(
            model=save_dir,
            tokenizer=model_name,
            dtype='bfloat16',
            distributed_executor_backend="mp",
            tensor_parallel_size=num_gpus_vllm,
            gpu_memory_utilization=gpu_utilization_vllm,
            enable_lora=False,
        )

sampling_params = SamplingParams(
        temperature=0,
        top_p=1,
        max_tokens=max_new_tokens,
        stop=stop_tokens
    )

completions = llm.generate(
            prompts,
            sampling_params,
        )

@BKitor However, the GPU memory cannot be released, except the first-initialized GPU (cuda:0 in my case).

import gc

import torch
from vllm.distributed.parallel_state import destroy_model_parallel

# Delete the llm object and free the memory
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm
gc.collect()
torch.cuda.empty_cache()
print("Successfully delete the llm pipeline and free the GPU memory.")

Do you have suggestions on releasing all the GPUs' memory?

Thank you very much, and have a nice day!

Best regards,

Shuyue
June 11th, 2024

hongxiayang · 2024-09-04T14:02:12Z

This issue should be closed as the current main branch supports multi-gpu on ROCm 6.1x.

SuperBruceJia mentioned this issue Jun 14, 2024

[Bug]: Can't run vllm distributed inference with vLLM + Ray #5094

Closed

hongxiayang added the rocm label Jul 13, 2024

hongxiayang closed this as completed Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi GPU ROCm6 issues, and workarounds #2794

Multi GPU ROCm6 issues, and workarounds #2794

BKitor commented Feb 6, 2024

SuperBruceJia commented Jun 10, 2024

BKitor commented Jun 10, 2024

SuperBruceJia commented Jun 10, 2024 •

edited

Loading

BKitor commented Jun 10, 2024

SuperBruceJia commented Jun 10, 2024

BKitor commented Jun 10, 2024 via email

SuperBruceJia commented Jun 11, 2024

hongxiayang commented Sep 4, 2024

Multi GPU ROCm6 issues, and workarounds #2794

Multi GPU ROCm6 issues, and workarounds #2794

Comments

BKitor commented Feb 6, 2024

SuperBruceJia commented Jun 10, 2024

BKitor commented Jun 10, 2024

SuperBruceJia commented Jun 10, 2024 • edited Loading

BKitor commented Jun 10, 2024

SuperBruceJia commented Jun 10, 2024

BKitor commented Jun 10, 2024 via email

SuperBruceJia commented Jun 11, 2024

hongxiayang commented Sep 4, 2024

SuperBruceJia commented Jun 10, 2024 •

edited

Loading