Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Can't run vllm distributed inference with vLLM + Ray #5094

Closed
linchen111 opened this issue May 29, 2024 · 9 comments
Closed

[Bug]: Can't run vllm distributed inference with vLLM + Ray #5094

linchen111 opened this issue May 29, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@linchen111
Copy link

Your current environment


Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.35

Python version: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 2080 Ti
GPU 1: NVIDIA GeForce RTX 2080 Ti
GPU 2: NVIDIA GeForce RTX 2080 Ti
GPU 3: NVIDIA GeForce RTX 2080 Ti
GPU 4: NVIDIA GeForce RTX 2080 Ti
GPU 5: NVIDIA GeForce RTX 2080 Ti
GPU 6: NVIDIA GeForce RTX 2080 Ti
GPU 7: NVIDIA GeForce RTX 2080 Ti

Nvidia driver version: 550.54.14
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             56
On-line CPU(s) list:                0-55
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
CPU family:                         6
Model:                              79
Thread(s) per core:                 2
Core(s) per socket:                 14
Socket(s):                          2
Stepping:                           1
CPU max MHz:                        3300.0000
CPU min MHz:                        1200.0000
BogoMIPS:                           4799.97
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d
L1d cache:                          896 KiB (28 instances)
L1i cache:                          896 KiB (28 instances)
L2 cache:                           7 MiB (28 instances)
L3 cache:                           70 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-13,28-41
NUMA node1 CPU(s):                  14-27,42-55
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                 Mitigation; PTE Inversion
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Mitigation; Clear CPU buffers; SMT vulnerable

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[pip3] vllm_nccl_cu12==2.18.1.0.4.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] torch                     2.3.0                    pypi_0    pypi
[conda] triton                    2.3.0                    pypi_0    pypi
[conda] vllm-nccl-cu12            2.18.1.0.4.0             pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX     PHB     PHB     SYS     SYS     SYS     SYS     PHB     0-13,28-41      0               N/A
GPU1    PIX      X      PHB     PHB     SYS     SYS     SYS     SYS     PHB     0-13,28-41      0               N/A
GPU2    PHB     PHB      X      PIX     SYS     SYS     SYS     SYS     PHB     0-13,28-41      0               N/A
GPU3    PHB     PHB     PIX      X      SYS     SYS     SYS     SYS     PHB     0-13,28-41      0               N/A
GPU4    SYS     SYS     SYS     SYS      X      PIX     PHB     PHB     SYS     14-27,42-55     1               N/A
GPU5    SYS     SYS     SYS     SYS     PIX      X      PHB     PHB     SYS     14-27,42-55     1               N/A
GPU6    SYS     SYS     SYS     SYS     PHB     PHB      X      PIX     SYS     14-27,42-55     1               N/A
GPU7    SYS     SYS     SYS     SYS     PHB     PHB     PIX      X      SYS     14-27,42-55     1               N/A
NIC0    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: rocep1s0

🐛 Describe the bug

I have two machines each equipped with eight 2080ti (22G) GPUs. Following the official tutorial, I ran ray start --head on the master node and ray start --address='xxx.xxx.xxx.xxx:6379' on another node.

I ran ray status to check, and here is the output:

(vllm042) root@ubuntu:~# ray status
======== Autoscaler status: 2024-05-29 04:15:40.185279 ========
Node status
---------------------------------------------------------------
Active:
 1 node_44ba686d06daa1e59b426d84054a70f867ae791fc1ac8e1ca9b89395
 1 node_ad1ec224eb7ec94e75064d9463a6b129a16ade62c047833e53ccdce7
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/112.0 CPU
 16.0/16.0 GPU (16.0 used of 16.0 reserved in placement groups)
 0B/163.40GiB memory
 0B/74.02GiB object_store_memory

Demands:
 (no resource demands)

However, when I run the following code:

from vllm import LLM
llm = LLM(model="/root/data_ssd/c4ai-command-r-plus",
          tensor_parallel_size=16, dtype="float16",
          gpu_memory_utilization=0.9,
          worker_use_ray=True,
          max_model_len=6000,
          enforce_eager=True)

I receive the following error:

WARNING 05-29 04:07:29 config.py:405] Possibly too large swap space. 64.00 GiB out of the 125.75 GiB total CPU memory is allocated for the swap space.
2024-05-29 04:07:29,877 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 192.168.100.21:6379...
2024-05-29 04:07:29,884 INFO worker.py:1749 -- Connected to Ray cluster.
INFO 05-29 04:07:30 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/root/data_ssd/c4ai-command-r-plus', speculative_config=None, tokenizer='/root/data_ssd/c4ai-command-r-plus', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=6000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=16, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/root/data_ssd/c4ai-command-r-plus)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 05-29 04:08:10 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(RayWorkerWrapper pid=6565, ip=192.168.100.22) INFO 05-29 04:08:10 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 05-29 04:08:16 selector.py:69] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 05-29 04:08:16 selector.py:32] Using XFormers backend.
(RayWorkerWrapper pid=17750) INFO 05-29 04:08:16 selector.py:69] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(RayWorkerWrapper pid=17750) INFO 05-29 04:08:16 selector.py:32] Using XFormers backend.
(RayWorkerWrapper pid=18175) INFO 05-29 04:08:10 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 [repeated 14x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] Traceback (most recent call last):
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] return executor(*args, **kwargs)
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/vllm/worker/worker.py", line 111, in init_device
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/vllm/worker/worker.py", line 288, in init_worker_distributed_environment
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] init_distributed_environment(parallel_config.world_size, rank,
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 78, in init_distributed_environment
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] _CPU_WORLD_GROUP = torch.distributed.new_group(ranks=ranks,
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] func_return = func(*args, **kwargs)
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization)
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3939, in _new_group_with_tag
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] pg, pg_store = _new_process_group_helper(
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper
(RayWorkerWrapper pid=6634, ip=192.168.100.22) ERROR 05-29 04:08:22 worker_base.py:145] backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
...
ERROR 05-29 04:08:22 worker_base.py:145] pg, pg_store = _new_process_group_helper(
ERROR 05-29 04:08:22 worker_base.py:145] File "/root/miniconda3/envs/vllm042/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper
ERROR 05-29 04:08:22 worker_base.py:145] backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
ERROR 05-29 04:08:22 worker_base.py:145] RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

@linchen111 linchen111 added the bug Something isn't working label May 29, 2024
@andoorve
Copy link
Collaborator

Try setting GLOO_SOCKET_IFNAME?

@linchen111
Copy link
Author

linchen111 commented Jun 3, 2024

Try setting GLOO_SOCKET_IFNAME?

I got my ifconfig like this:

eth2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.100.21  netmask 255.255.255.0  broadcast 192.168.100.255
        inet6 fe80::a236:9fff:fe80:6890  prefixlen 64  scopeid 0x20<link>
        ether a0:36:9f:80:68:90  txqueuelen 1000  (Ethernet)
        RX packets 4477  bytes 1317080 (1.3 MB)
        RX errors 0  dropped 6  overruns 0  frame 0
        TX packets 4520  bytes 3066153 (3.0 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.10.1.1  netmask 255.255.255.0  broadcast 192.10.1.255
        inet6 fe80::526b:4bff:fe51:2640  prefixlen 64  scopeid 0x20<link>
        ether 50:6b:4b:51:26:40  txqueuelen 1000  (Ethernet)
        RX packets 166  bytes 10550 (10.5 KB)
        RX errors 0  dropped 147  overruns 0  frame 0
        TX packets 14  bytes 1056 (1.0 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 5763  bytes 3492496 (3.4 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 5763  bytes 3492496 (3.4 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

and some information about eth2 and eth4 :


(vllm042) root@ubuntu:~# ethtool -i eth4
driver: mlx4_en
version: 4.0-0
firmware-version: 2.41.7308
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
(vllm042) root@ubuntu:~# ethtool -i eth2
driver: ixgbe
version: 5.15.0-107-generic
firmware-version: 0x8000059e, 17.0.12
expansion-rom-version:
bus-info: 0000:82:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

So, should I do :os.environ['GLOO_SOCKET_IFNAME'] = 'eth4' ?

@andoorve
Copy link
Collaborator

andoorve commented Jun 3, 2024

Not sure in this case, I think you could potentially try both? Depends on your HW setup.

@linchen111
Copy link
Author

Not sure in this case, I think you could potentially try both? Depends on your HW setup.

any instruction about HW setup?

@andoorve
Copy link
Collaborator

andoorve commented Jun 6, 2024

No I don't have too much information - I was using V100 nodes from https://org.nebius.ai/ where I needed to do export GLOO_SOCKET_IFNAME=eth0 based on the interfaces available on my instances

@linchen111
Copy link
Author

No I don't have too much information - I was using V100 nodes from https://org.nebius.ai/ where I needed to do export GLOO_SOCKET_IFNAME=eth0 based on the interfaces available on my instances

Thanks

@SuperBruceJia
Copy link

SuperBruceJia commented Jun 10, 2024

I also encountered this problem. Is there any solution? Thank you guys very much!
@andoorve @linchen111 @tmm1 @markmc @zhouyuan @WoosukKwon @youkaichao
It is directly related to these issues: #3455 and #2466.

/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:468: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 1285.22it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.20it/s]
/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:769: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Trainable params: 8,030,261,248 | All params: 8,030,261,248 | Trainable%: 100.00%
Successfully save the tokenizer!
Successfully save the model!


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-06-09 21:03:54,034 INFO worker.py:1753 -- Started a local Ray instance.
INFO 06-09 21:03:54 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='./save_folder', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=./save_folder)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(pid=3543508) /usr4/ec523/brucejia/.local/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
(pid=3543508)   warnings.warn(
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] Traceback (most recent call last):
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148]   File "/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148]   File "/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 105, in init_device
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148]     torch.cuda.set_device(self.device)
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148]   File "/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 399, in set_device
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148]     torch._C._cuda_setDevice(device)
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(RayWorkerWrapper pid=3543610) ERROR 06-09 21:03:59 worker_base.py:148] 

Best regards,

Shuyue
June 9th, 2024

@warlockedward
Copy link

I've encountered the same problem on version v.0.5.0

@SuperBruceJia
Copy link

I've encountered the same problem on version v.0.5.0

Please check this solution: #2794 (comment)

It works on my side. The only problem is that the memory of the distributed GPU cannot be released, unfortunately. I am still working on this problem.

Best regards,

Shuyue
June 13th, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants