vLLM running on a Ray Cluster Hanging on Initializing #2826

Kaotic3 · 2024-02-09T14:41:23Z

It isn't clear what is at fault here. Whether it be vLLM or Ray.

There is a thread here on the ray forums that outlines the issue, it is 16 days old, there is no reply to it.

https://discuss.ray.io/t/running-vllm-script-on-multi-node-cluster/13533

Taking from that thread, but this is identical for me.

2024-01-24 13:57:17,308 INFO worker.py:1540 – Connecting to existing Ray cluster at address: HOST_IP_ADDRESS…
2024-01-24 13:57:17,317 INFO worker.py:1715 – Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
INFO 01-24 13:57:39 llm_engine.py:70] Initializing an LLM engine with config: model=‘mistralai/Mistral-7B-Instruct-v0.2’, tokenizer=‘mistralai/Mistral-7B-Instruct-v0.2’, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)

But after that it hangs, and eventually quits.

I have exactly this same problem. The thread details the other points, that the "ray status" seems to show nodes working and communicating, that it stays like this for an age then eventually crashes with some error messages. Everything in that thread is identical to what is happening for me.

Unfortunately the Ray forums probably don't want to engage because it is vLLM - and I am concerned that vLLM won't want to engage because it is Ray.....

The text was updated successfully, but these errors were encountered:

valentinp72 · 2024-02-09T15:26:59Z

Hi,
Just started using vLLM two hours ago, and I had exactly the same issue.
I managed to make it work by disabling NCCL_P2P. For that, I exported NCCL_P2P_DISABLE=1.

Let me know if this solves your issue as well :)

Kaotic3 · 2024-02-09T18:17:25Z

Thanks for the idea, I did try it but didn't work for me.

Same hanging issue but I went off for dinner and came back to this message:

(RayWorkerVllm pid=7722, ip=.123) [E socket.cpp:922] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.1.1, 55251)

Which I think is a new error message compared to the thread I linked - but googling didn't provide me with any great insight into fixing.

Your search - "RayWorkerVllm" The client socket has timed out after 1800s while trying to connect - did not match any documents.

Which is always a little impressive to be honest....

davidsyoung · 2024-02-10T00:07:34Z

I am experiencing the same at the moment. For me, it happens with GPTQ quantisation with tp=4.

I have tried the following settings / combinations of settings without any luck:

NCCL_P2P_DISABLE=1
disable_custom_all_reduce=True
enforce_eager=True

Latest vLLM, compiled from source. It hangs at approx 12995GB VRAM on each card across 4x3090. 70b model llama2.

Finally hung at this after approx 1h:

[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
(RayWorkerVllm pid=1486) [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747549 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
[2024-02-10 00:06:27,464 E 13 1719] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
[2024-02-10 00:06:27,501 E 13 1719] logging.cc:104: Stack trace: 
 /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0xfebb5a) [0x14af5a1ebb5a] ray::operator<<()
/opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0xfee298) [0x14af5a1ee298] ray::TerminateHandler()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb135a) [0x14afcccb135a] __cxxabiv1::__terminate()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb13c5) [0x14afcccb13c5]
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb134f) [0x14afcccb134f]
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xcc860b) [0x14af804c860b] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xdbbf4) [0x14afcccdbbf4] execute_native_thread_routine
/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x14b00f5e4609] start_thread
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x14b00f3af353] __clone

*** SIGABRT received at time=1707523587 on cpu 16 ***
PC: @     0x14b00f2d300b  (unknown)  raise
    @     0x14b00f5f0420       3792  (unknown)
    @     0x14afcccb135a  (unknown)  __cxxabiv1::__terminate()
    @     0x14afcccb1070  (unknown)  (unknown)
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361: *** SIGABRT received at time=1707523587 on cpu 16 ***
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361: PC: @     0x14b00f2d300b  (unknown)  raise
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14b00f5f0420       3792  (unknown)
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14afcccb135a  (unknown)  __cxxabiv1::__terminate()
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14afcccb1070  (unknown)  (unknown)
Fatal Python error: Aborted


Extension modules: mkl._mklinit, mkl._py_mkl_service, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, gmpy2.gmpy2, regex._regex, scipy._lib._ccallback_c, yaml._yaml, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, _brotli, markupsafe._speedups, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, zstandard.backend_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.optimize._direct (total: 98)

ffolkes1911 · 2024-02-14T10:32:36Z

Tried same options as above, and using Ray, but it did not help.

What did work was using a GPTQ model, as it seems that only AWQ models hang (only tried those two on multi-GPU)
EDIT: tested on TheBloke/Llama-2-13B-chat-AWQ and GPTQ
EDIT2: seems that this issue is about Ray Cluster, whereas I was just adding --tensor-parallel-size to vllm, so might be different issue

BilalKHA95 · 2024-02-22T07:19:10Z

i've the same issue did you found a solution ? @ffolkes1911

jony0113 · 2024-02-22T08:14:58Z

I have the similar issue, but it can eventually work after about 40 minutes, I have describe the detail in #2959

Kaotic3 · 2024-02-27T11:55:16Z

Hey WoosuKwon.

I just cloned the repo and built it and then started Ray on two machines and then initiated vLLM with tensors=4.

The result is that vLLM is hanging and not moving past the "Initializing an LLM engine with config:...."

While I think that PR no doubt fixed some problem, it doesn't appear to have fixed this problem - which is that using Ray Cluster across two different machines results in vLLM hanging and not starting.

viewv · 2024-03-11T12:15:46Z

Hey WoosuKwon.

I just cloned the repo and built it and then started Ray on two machines and then initiated vLLM with tensors=4.

The result is that vLLM is hanging and not moving past the "Initializing an LLM engine with config:...."

While I think that PR no doubt fixed some problem, it doesn't appear to have fixed this problem - which is that using Ray Cluster across two different machines results in vLLM hanging and not starting.

~~I have this issue too, I don't know how to fix it.~~
Fixed: #2826 (comment)

thelongestusernameofall · 2024-03-21T13:26:23Z

Hi, Just started using vLLM two hours ago, and I had exactly the same issue. I managed to make it work by disabling NCCL_P2P. For that, I exported NCCL_P2P_DISABLE=1.

Let me know if this solves your issue as well :)

export NCCL_P2P_DISABLE=1
worked for me. I'm using A6000 * 8， loading model with vllm, hanging and followed by a core dump after very long waiting.

Thanks very much.

viewv · 2024-03-22T09:13:34Z

Hi, Just started using vLLM two hours ago, and I had exactly the same issue. I managed to make it work by disabling NCCL_P2P. For that, I exported NCCL_P2P_DISABLE=1.
Let me know if this solves your issue as well :)

export NCCL_P2P_DISABLE=1 worked for me. I'm using A6000 * 8， loading model with vllm, hanging and followed by a core dump after very long waiting.

Thanks very much.

Thank you very much, I have fixed the problem, the problem is that I have multiple network card, so I use the NCCL_SOCKET_IFNAME=eth0 to select the correct network card, and fix it.

huiyeruzhou · 2024-05-03T08:58:08Z

try “ray stop” command, it does work for me.

ChristineSeven · 2024-05-30T08:01:54Z

I am experiencing the same at the moment. For me, it happens with GPTQ quantisation with tp=4.

I have tried the following settings / combinations of settings without any luck:

NCCL_P2P_DISABLE=1 disable_custom_all_reduce=True enforce_eager=True

Latest vLLM, compiled from source. It hangs at approx 12995GB VRAM on each card across 4x3090. 70b model llama2.

Finally hung at this after approx 1h:

[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
(RayWorkerVllm pid=1486) [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=48, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747549 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
[2024-02-10 00:06:27,464 E 13 1719] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49, OpType=ALLREDUCE, NumelIn=268435456, NumelOut=268435456, Timeout(ms)=1800000) ran for 3747540 milliseconds before timing out.
[2024-02-10 00:06:27,501 E 13 1719] logging.cc:104: Stack trace: 
 /opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0xfebb5a) [0x14af5a1ebb5a] ray::operator<<()
/opt/conda/lib/python3.10/site-packages/ray/_raylet.so(+0xfee298) [0x14af5a1ee298] ray::TerminateHandler()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb135a) [0x14afcccb135a] __cxxabiv1::__terminate()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb13c5) [0x14afcccb13c5]
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb134f) [0x14afcccb134f]
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xcc860b) [0x14af804c860b] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xdbbf4) [0x14afcccdbbf4] execute_native_thread_routine
/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x14b00f5e4609] start_thread
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x14b00f3af353] __clone

*** SIGABRT received at time=1707523587 on cpu 16 ***
PC: @     0x14b00f2d300b  (unknown)  raise
    @     0x14b00f5f0420       3792  (unknown)
    @     0x14afcccb135a  (unknown)  __cxxabiv1::__terminate()
    @     0x14afcccb1070  (unknown)  (unknown)
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361: *** SIGABRT received at time=1707523587 on cpu 16 ***
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361: PC: @     0x14b00f2d300b  (unknown)  raise
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14b00f5f0420       3792  (unknown)
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14afcccb135a  (unknown)  __cxxabiv1::__terminate()
[2024-02-10 00:06:27,502 E 13 1719] logging.cc:361:     @     0x14afcccb1070  (unknown)  (unknown)
Fatal Python error: Aborted


Extension modules: mkl._mklinit, mkl._py_mkl_service, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, gmpy2.gmpy2, regex._regex, scipy._lib._ccallback_c, yaml._yaml, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, _brotli, markupsafe._speedups, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, zstandard.backend_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.optimize._direct (total: 98)

How did you fix this? I got the same issue in vllm version 0.3.3 on A100 2 cards.Thanks in advance

esmeetu mentioned this issue Feb 26, 2024

Fix using CuPy for eager mode #3037

Merged

WoosukKwon closed this as completed in #3037 Feb 27, 2024

lvhan028 mentioned this issue Jun 11, 2024

[Bug] Official image doesn't work for 4090 on CUDA 12.3 (but works for all other CUDA versions, and works for 12.3 on other GPU types) InternLM/lmdeploy#1750

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM running on a Ray Cluster Hanging on Initializing #2826

vLLM running on a Ray Cluster Hanging on Initializing #2826

Kaotic3 commented Feb 9, 2024

valentinp72 commented Feb 9, 2024

Kaotic3 commented Feb 9, 2024

davidsyoung commented Feb 10, 2024 •

edited

Loading

ffolkes1911 commented Feb 14, 2024 •

edited

Loading

BilalKHA95 commented Feb 22, 2024 •

edited

Loading

jony0113 commented Feb 22, 2024

Kaotic3 commented Feb 27, 2024

viewv commented Mar 11, 2024 •

edited

Loading

thelongestusernameofall commented Mar 21, 2024 •

edited

Loading

viewv commented Mar 22, 2024

huiyeruzhou commented May 3, 2024

ChristineSeven commented May 30, 2024 •

edited

Loading

vLLM running on a Ray Cluster Hanging on Initializing #2826

vLLM running on a Ray Cluster Hanging on Initializing #2826

Comments

Kaotic3 commented Feb 9, 2024

valentinp72 commented Feb 9, 2024

Kaotic3 commented Feb 9, 2024

davidsyoung commented Feb 10, 2024 • edited Loading

ffolkes1911 commented Feb 14, 2024 • edited Loading

BilalKHA95 commented Feb 22, 2024 • edited Loading

jony0113 commented Feb 22, 2024

Kaotic3 commented Feb 27, 2024

viewv commented Mar 11, 2024 • edited Loading

thelongestusernameofall commented Mar 21, 2024 • edited Loading

viewv commented Mar 22, 2024

huiyeruzhou commented May 3, 2024

ChristineSeven commented May 30, 2024 • edited Loading

davidsyoung commented Feb 10, 2024 •

edited

Loading

ffolkes1911 commented Feb 14, 2024 •

edited

Loading

BilalKHA95 commented Feb 22, 2024 •

edited

Loading

viewv commented Mar 11, 2024 •

edited

Loading

thelongestusernameofall commented Mar 21, 2024 •

edited

Loading

ChristineSeven commented May 30, 2024 •

edited

Loading