Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU ROCm6 issues, and workarounds #2794

Closed
BKitor opened this issue Feb 6, 2024 · 8 comments
Closed

Multi GPU ROCm6 issues, and workarounds #2794

BKitor opened this issue Feb 6, 2024 · 8 comments
Labels

Comments

@BKitor
Copy link
Contributor

BKitor commented Feb 6, 2024

I ran into a series of issues trying to get VLLM stood up on a system with multiple MI210s. I figured I'd document my issues and workarounds so that someone could pick up the baton later, or at least save someone some debugging time later.

  1. Ray will deadlock with multiple AMD GPUs. Ray doesn't officially support AMD GPUs in v2.9; I updated Ray to nightlies (v3.0).
pip uninstall ray
pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl"
  1. Something might have changed with how Ray exposes GPUs to workers. Only 1 GPU was exposed to each worker, so torch.cuda.set_device() with anything other than 0 would fail. I tweaked worker.py to always use 0, but I don't think this is a viable long-term fix.
diff --git a/vllm/worker/worker.py b/vllm/worker/worker.py
index c97e82a..a63fbd9 100644
--- a/vllm/worker/worker.py
+++ b/vllm/worker/worker.py
@@ -68,6 +68,7 @@ class Worker:
         self.gpu_cache = None

     def init_model(self) -> None:
+        print(f"***** local_rank {self.local_rank} hit init_model, is_driver: {self.is_driver_worker} *****")
         if self.device_config.device.type == "cuda":
             # torch.distributed.all_reduce does not free the input tensor until
             # the synchronization point. This causes the memory usage to grow
@@ -80,7 +81,9 @@ class Worker:
             # This env var set by Ray causes exceptions with graph building.
             os.environ.pop("NCCL_ASYNC_ERROR_HANDLING", None)
             self.device = torch.device(f"cuda:{self.local_rank}")
-            torch.cuda.set_device(self.device)
+            print(f"***** trying to set dev {self.device} of {torch.cuda.device_count()} is_driver: {self.is_driver_worker} *****")
+            # torch.cuda.set_device(self.device)
+            torch.cuda.set_device(0)

             _check_if_gpu_supports_dtype(self.model_config.dtype)
         else:
@SuperBruceJia
Copy link

@BKitor Have you found any solution for distributed inference? Thank you very much in advance!

Best regards,

Shuyue
June 9th, 2024

@BKitor
Copy link
Contributor Author

BKitor commented Jun 10, 2024

Sorry, haven't poked this in a while (lost access to multi-node system).
But for single-node multi-gpu training, the 'mp' distributed_execution_backend has been fairly stable.

@SuperBruceJia
Copy link

SuperBruceJia commented Jun 10, 2024

@BKitor Benjamin, I am using single-node multi-GPUs but there is a problem regarding the init_device (https://github.com/vllm-project/vllm/blob/main/vllm/worker/worker.py#L92-L118).

Do you have any idea how to solve it?

Thank you very much, and have a nice day!

2024-06-10 16:36:06,142 INFO worker.py:1568 -- Connecting to existing Ray cluster at address: 192.168.19.245:6379...
2024-06-10 16:36:06,142 INFO worker.py:1586 -- Calling ray.init() again after it has already been called.
INFO 06-10 16:36:06 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='./save_folder', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=./save_folder)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(pid=487121) /usr4/ec523/brucejia/.local/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
(pid=487121)   warnings.warn(
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] Traceback (most recent call last):
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]   File "/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 141, in execute_method
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]   File "/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 106, in init_device
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]     torch.cuda.set_device(self.device)
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]   File "/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 399, in set_device
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]     torch._C._cuda_setDevice(device)
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] 

Best regards,

Shuyue
June 10th, 2024

@BKitor
Copy link
Contributor Author

BKitor commented Jun 10, 2024

What I'm suggesting is to not use ray.
One of the arguments when instantiating a model is distributed_execution_backend, where the options include 'ray' or 'mp'.
I'm not sure how you're launching your model, you might have to insert distrubted_execution_backend="mp" where you create the llm. i.e. from vllm import LLM; llm = LLM(<whatever your args already are>, distrubted_execution_backend="mp).
Otherwise, some of the provided helper scripts let you specify --distributed-execution-backend on the command line, but this isn't universal to YMYV.

@SuperBruceJia
Copy link

What I'm suggesting is to not use ray. One of the arguments when instantiating a model is distributed_execution_backend, where the options include 'ray' or 'mp'. I'm not sure how you're launching your model, you might have to insert distrubted_execution_backend="mp" where you create the llm. i.e. from vllm import LLM; llm = LLM(<whatever your args already are>, distrubted_execution_backend="mp). Otherwise, some of the provided helper scripts let you specify --distributed-execution-backend on the command line, but this isn't universal to YMYV.

@BKitor Benjamin, it seems that there is no distrubted_execution_backend argument in LLM: https://github.com/vllm-project/vllm/blob/main/vllm/engine/llm_engine.py. However, there is one in the AsyncLLMEngine: https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py, which is for serving.

May I know which vLLM version you are using?

Thank you very much, and have a nice day!

Best regards,

Shuyue
June 10th, 2024

@BKitor
Copy link
Contributor Author

BKitor commented Jun 10, 2024 via email

@SuperBruceJia
Copy link

The file you're looking for is args_util.py, and it's present in 0.4.3

distributed_executor_backend: Optional[str] = None


On Mon, Jun 10, 2024 at 2:36 PM Shuyue Jia @.> wrote: What I'm suggesting is to not use ray. One of the arguments when instantiating a model is distributed_execution_backend, where the options include 'ray' or 'mp'. I'm not sure how you're launching your model, you might have to insert distrubted_execution_backend="mp" where you create the llm. i.e. from vllm import LLM; llm = LLM(, distrubted_execution_backend="mp). Otherwise, some of the provided helper scripts let you specify --distributed-execution-backend on the command line, but this isn't universal to YMYV. @BKitor https://github.com/BKitor Benjamin, it seems that there is no distrubted_execution_backend argument in LLM: https://github.com/vllm-project/vllm/blob/main/vllm/engine/llm_engine.py. However, there is one in the AsyncLLMEngine: https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py, which is for serving. May I know which vLLM version you are using? Thank you very much, and have a nice day! Best regards, Shuyue June 10th, 2024 — Reply to this email directly, view it on GitHub <#2794 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEAZJUXNV36SANY7YRYL5OTZGYL4FAVCNFSM6AAAAABC4X7FNGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJZGMZDIMRZGU . You are receiving this because you were mentioned.Message ID: @.>

Thank you very much, Benjamin! It really helps.

Now, my multi-gpu inference can be running smoothly. For other researchers' reference:

I use vLLM 0.4.3: https://github.com/vllm-project/vllm/releases/tag/v0.4.3

llm = LLM(
            model=save_dir,
            tokenizer=model_name,
            dtype='bfloat16',
            distributed_executor_backend="mp",
            tensor_parallel_size=num_gpus_vllm,
            gpu_memory_utilization=gpu_utilization_vllm,
            enable_lora=False,
        )

sampling_params = SamplingParams(
        temperature=0,
        top_p=1,
        max_tokens=max_new_tokens,
        stop=stop_tokens
    )

completions = llm.generate(
            prompts,
            sampling_params,
        )

@BKitor However, the GPU memory cannot be released, except the first-initialized GPU (cuda:0 in my case).

import gc

import torch
from vllm.distributed.parallel_state import destroy_model_parallel

# Delete the llm object and free the memory
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm
gc.collect()
torch.cuda.empty_cache()
print("Successfully delete the llm pipeline and free the GPU memory.")

Do you have suggestions on releasing all the GPUs' memory?

Thank you very much, and have a nice day!

Best regards,

Shuyue
June 11th, 2024

@hongxiayang
Copy link
Collaborator

This issue should be closed as the current main branch supports multi-gpu on ROCm 6.1x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants