-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Can't run vllm distributed inference with vLLM + Ray #5094
Comments
Try setting |
I got my ifconfig like this:
and some information about eth2 and eth4 :
So, should I do :os.environ['GLOO_SOCKET_IFNAME'] = 'eth4' ? |
Not sure in this case, I think you could potentially try both? Depends on your HW setup. |
any instruction about HW setup? |
No I don't have too much information - I was using V100 nodes from https://org.nebius.ai/ where I needed to do |
Thanks |
I also encountered this problem. Is there any solution? Thank you guys very much! /usr4/ec523/brucejia/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:468: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
warnings.warn(
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 1285.22it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.20it/s]
/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:769: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Trainable params: 8,030,261,248 | All params: 8,030,261,248 | Trainable%: 100.00%
Successfully save the tokenizer!
Successfully save the model!
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-06-09 21:03:54,034 INFO worker.py:1753 -- Started a local Ray instance.
INFO 06-09 21:03:54 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='./save_folder', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=./save_folder)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(pid=3543508) /usr4/ec523/brucejia/.local/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
(pid=3543508) warnings.warn(
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] Traceback (most recent call last):
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] File "/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] return executor(*args, **kwargs)
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] File "/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 105, in init_device
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] torch.cuda.set_device(self.device)
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] File "/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 399, in set_device
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] torch._C._cuda_setDevice(device)
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] Best regards, Shuyue |
I've encountered the same problem on version v.0.5.0 |
Please check this solution: #2794 (comment) It works on my side. The only problem is that the memory of the distributed GPU cannot be released, unfortunately. I am still working on this problem. Best regards, Shuyue |
Your current environment
🐛 Describe the bug
I have two machines each equipped with eight 2080ti (22G) GPUs. Following the official tutorial, I ran ray start --head on the master node and ray start --address='xxx.xxx.xxx.xxx:6379' on another node.
I ran ray status to check, and here is the output:
However, when I run the following code:
I receive the following error:
WARNING 05-29 04:07:29 config.py:405] Possibly too large swap space. 64.00 GiB out of the 125.75 GiB total CPU memory is allocated for the swap space.
2024-05-29 04:07:29,877 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 192.168.100.21:6379...
2024-05-29 04:07:29,884 INFO worker.py:1749 -- Connected to Ray cluster.
INFO 05-29 04:07:30 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/root/data_ssd/c4ai-command-r-plus', speculative_config=None, tokenizer='/root/data_ssd/c4ai-command-r-plus', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=6000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=16, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/root/data_ssd/c4ai-command-r-plus)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 05-29 04:08:10 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(RayWorkerWrapper pid=6565, ip=192.168.100.22) INFO 05-29 04:08:10 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 05-29 04:08:16 selector.py:69] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 05-29 04:08:16 selector.py:32] Using XFormers backend.
(RayWorkerWrapper pid=17750) INFO 05-29 04:08:16 selector.py:69] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(RayWorkerWrapper pid=17750) INFO 05-29 04:08:16 selector.py:32] Using XFormers backend.
(RayWorkerWrapper pid=18175) INFO 05-29 04:08:10 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 [repeated 14x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] Traceback (most recent call last):
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] return executor(*args, **kwargs)
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/vllm/worker/worker.py", line 111, in init_device
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/vllm/worker/worker.py", line 288, in init_worker_distributed_environment
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] init_distributed_environment(parallel_config.world_size, rank,
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 78, in init_distributed_environment
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] _CPU_WORLD_GROUP = torch.distributed.new_group(ranks=ranks,
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] func_return = func(*args, **kwargs)
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization)
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3939, in _new_group_with_tag
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] pg, pg_store = _new_process_group_helper(
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
...
ERROR 05-29 04:08:22 worker_base.py:145] pg, pg_store = _new_process_group_helper(
ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper
ERROR 05-29 04:08:22 worker_base.py:145] backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
ERROR 05-29 04:08:22 worker_base.py:145] RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
The text was updated successfully, but these errors were encountered: