Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CommunicationTest.SendRecv/UCC hangs. #3120

Open
wujingyue opened this issue Oct 7, 2024 · 5 comments
Open

CommunicationTest.SendRecv/UCC hangs. #3120

wujingyue opened this issue Oct 7, 2024 · 5 comments
Assignees
Labels
bug Something isn't working Multidevice

Comments

@wujingyue
Copy link
Collaborator

It runs OK in GitHub CI, which runs with V100x4 and A100x4, but fails consistently on H100.

@csarofeen and I managed to reproduce this on viking-prod-231 in partition viking-prod-pjnl.

$ git rev-parse HEAD
61a77e0a64d5bc446ba1c009f04a19204a28eab2

$ _bn && NVIDIA_TF32_OVERRIDE=0 mpirun -np 4 bin/test_multidevice

Other tests pass with CommunicationTest.SendRecv/UCC excluded.

$ _bn && NVIDIA_TF32_OVERRIDE=0 mpirun -np 4 bin/test_multidevice --gtest_filter=-CommunicationTest.SendRecv/UCC
@wujingyue
Copy link
Collaborator Author

@samnordmann would you mind taking a look?

wujingyue added a commit that referenced this issue Oct 7, 2024
@samnordmann
Copy link
Collaborator

samnordmann commented Oct 8, 2024

I am able to reproduce the issue on viking H100 dgx node and am able to give an explanation of what is going on.

What

There is a known incompatibility between user's stream operations and UCX using nvLink over cuda-IPC, which can cause hangs. This is what we are seeing here. Both UCX and nvFuser post operations on the stream and this causes a deadlock.

Temporary workaround

We can disable the usage of cuda IPC in UCX by setting the flags UCX_RNDV_THRESH=0 and UCX_TLS=ib,cuda_copy. This way, the command

mpirun -np 4 -x UCX_RNDV_THRESH=0 -x UCX_TLS=ib,cuda_copy bin/test_multidevice --gtest_filter=-CommunicationTest.SendRecv/UCC

executes smoothly.

With those flags, UCX will GPU-direct RDMA, so we probably need the node to have a capable NIC. GPU-direct RDMA is stream-less, therefore there is no deadlock issue

Long Term fix

UCX and UCC team are working on a solution, as part of POR: https://redmine.mellanox.com/issues/3831841

Backtraces, for the record

Threads involved:

  Id   Target Id                                            Frame
  1    Thread 0x7fd5653fd000 (LWP 197328) "test_multidevic" 0x00007fd4e42a043c in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
  2    Thread 0x7fd4ea7ff000 (LWP 197330) "fuse"            __GI___libc_read (nbytes=271, buf=0x7fd4ea7c8670, fd=4) at ../sysdeps/unix/sysv/linux/read.c:26
  3    Thread 0x7fd4e0361000 (LWP 197332) "cuda00002000009" 0x00007fd5662cfbcf in __GI___poll (fds=0x7fd4dbe01000, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  4    Thread 0x7fd2ceab3000 (LWP 197334) "cuda-EvtHandlr"  0x00007fd5662cfbcf in __GI___poll (fds=0x7fd2ca608000, nfds=11, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
  5    Thread 0x7fd27addd000 (LWP 197336) "async"           0x00007fd5662dce2e in epoll_wait (epfd=70, events=events@entry=0x7fd27ada7760, maxevents=16, timeout=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
* 6    Thread 0x7fd2431dc000 (LWP 197338) "ucc-progress"    __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3,
    futex_word=0x7fd4e0ebee8c) at ./nptl/futex-internal.c:103
  7    Thread 0x7fd23f1db000 (LWP 197340) "test_multidevic" 0x00007fd5662de45f in __libc_accept (fd=179, addr=..., len=0x7fd23f1a57a0) at ../sysdeps/unix/sysv/linux/accept.c:26
  8    Thread 0x7fd23afda000 (LWP 197341) "pt_nccl_watchdg" __futex_abstimed_wait_common64 (private=2134846251, cancel=true, abstime=0x7fd23afa42c0, op=137, expected=0,
    futex_word=0x7fd4e0f11688) at ./nptl/futex-internal.c:57
  9    Thread 0x7fd2369ff000 (LWP 197342) "pt_nccl_heartbt" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7fd2369c95f0, op=137, expected=0, futex_word=0x7fd4e0f116b8)
    at ./nptl/futex-internal.c:57

backtrace of the main thread:

#0  0x00007fd4e42a043c in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#0  0x00007fd4e42a043c in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#1  0x00007fd4e3f5368c in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#2  0x00007fd4e429ee48 in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#3  0x00007fd4e400570f in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#4  0x00007fd4e3feb57a in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#5  0x00007fd4e3ff1b4d in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#6  0x00007fd4e4059574 in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#7  0x00007fd56723bcc3 in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#8  0x00007fd56723c410 in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#9  0x00007fd56723c47e in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#10 0x00007fd56723f100 in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#11 0x00007fd567215a4e in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#12 0x00007fd567275a73 in cudaLaunchKernel () from /usr/local/cuda/lib64/libcudart.so.12
#13 0x00007fd56905c535 in void at::native::gpu_kernel_impl_nocast<at::native::FillFunctor<float> >(at::TensorIteratorBase&, at::native::FillFunctor<float> const&) ()
   from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#14 0x00007fd56904b5cb in at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar const&)::{lambda()#1}::operator()() const ()
   from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#15 0x00007fd56904d4f2 in at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar const&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#16 0x00007fd5945e99d5 in at::native::fill_out(at::Tensor&, c10::Scalar const&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#17 0x00007fd56aaba5f1 in at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_Scalar_fill_(at::Tensor&, c10::Scalar const&) ()
   from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#18 0x00007fd594e0872d in at::_ops::fill__Scalar::call(at::Tensor&, c10::Scalar const&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#19 0x00007fd5945e9def in at::native::zero_(at::Tensor&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#20 0x00007fd56aab9309 in at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__zero_(at::Tensor&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#21 0x00007fd597b267cc in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor& (c10::DispatchKeySet, at::Tensor&), &torch::ADInplaceOrView::(anonymous namespace)::zero_>, at::Tensor&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor&> >, at::Tensor& (c10::DispatchKeySet, at::Tensor&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#22 0x00007fd5972d1e24 in torch::autograd::VariableType::(anonymous namespace)::zero_(c10::DispatchKeySet, at::Tensor&) ()
   from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#23 0x00007fd595312253 in at::_ops::zero_::call(at::Tensor&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#24 0x00007fd5948ac9b3 in at::native::zeros_symint(c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>) ()
   from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#25 0x00007fd5956e6fdb in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__zeros>, at::Tensor, c10::guts::typelist::typelist<c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool> > >, at::Tensor (c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>) ()
   from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#26 0x00007fd594dbd7a9 in at::_ops::zeros::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#27 0x00007fd59551f564 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>), &at::(anonymous namespace)::zeros>, at::Tensor, c10::guts::typelist::typelist<c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool> > >, at::Tensor (c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#28 0x00007fd594e2048f in at::_ops::zeros::call(c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>) ()
   from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#29 0x00007fd568728002 in c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#30 0x00005626042121cc in nvfuser::Communicator::barrier (this=0x7fd4e0b2e460, backend=std::optional<nvfuser::CommunicatorBackend> [no contained value])
    at /opt/pytorch/Fuser_local/csrc/multidevice/communicator.cpp:308
#31 0x000056260462593a in nvfuser::MultiDeviceTest::~MultiDeviceTest (this=0x7fd4e0ddba20, __in_chrg=<optimized out>) at /opt/pytorch/Fuser_local/tests/cpp/multidevice.cpp:87
#32 0x000056260464c775 in nvfuser::CommunicationTest::~CommunicationTest (this=0x7fd4e0ddba20, __in_chrg=<optimized out>)
    at /opt/pytorch/Fuser_local/tests/cpp/test_multidevice_communications.cpp:23
#33 0x000056260465488f in nvfuser::CommunicationTest_SendRecv_Test::~CommunicationTest_SendRecv_Test (this=0x7fd4e0ddba20, __in_chrg=<optimized out>)
    at /opt/pytorch/Fuser_local/tests/cpp/test_multidevice_communications.cpp:208
#34 0x00005626046548b8 in nvfuser::CommunicationTest_SendRecv_Test::~CommunicationTest_SendRecv_Test (this=0x7fd4e0ddba20, __in_chrg=<optimized out>)
    at /opt/pytorch/Fuser_local/tests/cpp/test_multidevice_communications.cpp:208
#35 0x000056260478e2e2 in testing::Test::DeleteSelf_ (this=0x7fd4e0ddba20) at /opt/pytorch/Fuser_local/third_party/googletest/googletest/include/gtest/gtest.h:336
#36 0x000056260479e66d in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=0x7fd4e0ddba20,
    method=(void (testing::Test::*)(testing::Test * const)) 0x56260478e2b4 <testing::Test::DeleteSelf_()>, location=0x562604c306df "the test fixture's destructor")
    at /opt/pytorch/Fuser_local/third_party/googletest/googletest/src/gtest.cc:2612
#37 0x0000562604797e05 in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0x7fd4e0ddba20,
    method=(void (testing::Test::*)(testing::Test * const)) 0x56260478e2b4 <testing::Test::DeleteSelf_()>, location=0x562604c306df "the test fixture's destructor")
    at /opt/pytorch/Fuser_local/third_party/googletest/googletest/src/gtest.cc:2648
#38 0x0000562604774121 in testing::TestInfo::Run (this=0x7fd4e3d1c8c0) at /opt/pytorch/Fuser_local/third_party/googletest/googletest/src/gtest.cc:2842
#39 0x0000562604774acd in testing::TestSuite::Run (this=0x7fd4e3c9b9c0) at /opt/pytorch/Fuser_local/third_party/googletest/googletest/src/gtest.cc:3015
#40 0x0000562604784ff4 in testing::internal::UnitTestImpl::RunAllTests (this=0x7fd4e0d4e300) at /opt/pytorch/Fuser_local/third_party/googletest/googletest/src/gtest.cc:5920
#41 0x000056260479f608 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x7fd4e0d4e300,
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x562604784bda <testing::internal::UnitTestImpl::RunAllTests()>,
    location=0x562604c30fa0 "auxiliary test code (environments or event listeners)") at /opt/pytorch/Fuser_local/third_party/googletest/googletest/src/gtest.cc:2612
#42 0x0000562604798f09 in testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x7fd4e0d4e300,
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x562604784bda <testing::internal::UnitTestImpl::RunAllTests()>,
    location=0x562604c30fa0 "auxiliary test code (environments or event listeners)") at /opt/pytorch/Fuser_local/third_party/googletest/googletest/src/gtest.cc:2648
#43 0x00005626047835f5 in testing::UnitTest::Run (this=0x562605203800 <testing::UnitTest::GetInstance()::instance>)
    at /opt/pytorch/Fuser_local/third_party/googletest/googletest/src/gtest.cc:5484
#44 0x0000562604627dcf in RUN_ALL_TESTS () at /opt/pytorch/Fuser_local/third_party/googletest/googletest/include/gtest/gtest.h:2317
#45 0x000056260462611c in main (argc=1, argv=0x7ffeb8a47258) at /opt/pytorch/Fuser_local/tests/cpp/multidevice.cpp:161
snordmann@viking-prod-237:/opt/pytorch/Fuser_local$

Backtrace of the ucc-progress thread:

#0  __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x7fd4e0ebee8c) at ./nptl/futex-internal.c:103
#1  __GI___futex_abstimed_wait64 (futex_word=futex_word@entry=0x7fd4e0ebee8c, expected=expected@entry=3, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>)
    at ./nptl/futex-internal.c:128
#2  0x00007fd56625224f in __pthread_rwlock_wrlock_full64 (abstime=0x0, clockid=0, rwlock=0x7fd4e0ebee80) at ./nptl/pthread_rwlock_common.c:730
#3  ___pthread_rwlock_wrlock (rwlock=0x7fd4e0ebee80) at ./nptl/pthread_rwlock_wrlock.c:26
#4  0x00007fd4e42a0fd4 in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#5  0x00007fd4e3f14c4e in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#6  0x00007fd4e407340c in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#7  0x00007fd5100533bd in uct_cuda_ipc_iface_init_streams (iface=iface@entry=0x7fd2a1dd0000) at cuda_ipc/cuda_ipc_iface.c:400
#8  0x00007fd510053a2e in uct_cuda_ipc_post_cuda_async_copy (direction=0, comp=0x7fd25cf32d90, rkey=140540909625456, iov=0x7fd2431a61f0, remote_addr=140479922438144, tl_ep=<optimized out>)
    at cuda_ipc/cuda_ipc_ep.c:100
#9  uct_cuda_ipc_ep_put_zcopy (tl_ep=<optimized out>, iov=0x7fd2431a61f0, iovcnt=<optimized out>, remote_addr=140479922438144, rkey=140540909625456, comp=0x7fd25cf32d90)
    at cuda_ipc/cuda_ipc_ep.c:178
#10 0x00007fd55b58f437 in uct_ep_put_zcopy (comp=0x7fd25cf32d90, rkey=<optimized out>, remote_addr=<optimized out>, iovcnt=1, iov=0x7fd2431a61f0, ep=<optimized out>)
    at /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/uct/api/uct.h:2915
#11 ucp_proto_rndv_put_common_send (comp=0x7fd25cf32d90, iov=0x7fd2431a61f0, lpriv=<optimized out>, req=0x7fd25cf32d00) at rndv/rndv_put.c:59
#12 ucp_proto_rndv_put_zcopy_send_func (lane_shift=<synthetic pointer>, next_iter=<synthetic pointer>, lpriv=<optimized out>, req=0x7fd25cf32d00) at rndv/rndv_put.c:363
#13 ucp_proto_multi_progress (dt_mask=1, complete_func=<optimized out>, send_func=<optimized out>, mpriv=0x7fd23aff63a0, req=0x7fd25cf32d00)
    at /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/ucp/proto/proto_multi.inl:177
#14 ucp_proto_multi_zcopy_progress (uct_comp_cb=<optimized out>, complete_func=<optimized out>, send_func=<optimized out>, dt_mask=1, uct_mem_flags=256, init_func=<optimized out>,
    mpriv=0x7fd23aff63a0, req=0x7fd25cf32d00) at /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/ucp/proto/proto_multi.inl:246
#15 ucp_proto_rndv_put_zcopy_send_progress (uct_req=0x7fd25cf32dd8) at rndv/rndv_put.c:373
#16 0x00007fd55b5887eb in ucp_request_try_send (req=0x7fd25cf32d00) at /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/ucp/core/ucp_request.inl:307
#17 ucp_request_send (req=<optimized out>) at /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/ucp/core/ucp_request.inl:330
#18 ucp_proto_rndv_send_start (worker=<optimized out>, op_attr_mask=<optimized out>, rtr=<optimized out>, header_length=<optimized out>, sg_count=<optimized out>, req=<optimized out>)
    at rndv/proto_rndv.c:845
#19 ucp_proto_rndv_send_start (worker=<optimized out>, req=0x7fd25cf32d00, op_attr_mask=<optimized out>, rtr=<optimized out>, header_length=<optimized out>, sg_count=<optimized out>)
    at rndv/proto_rndv.c:820
#20 0x00007fd55b5889a1 in ucp_proto_rndv_handle_rtr (arg=0x7fd275772100, data=0x7fd2c4feeac0, length=<optimized out>, flags=<optimized out>) at rndv/proto_rndv.c:902
#21 0x00007fd566175eb9 in uct_iface_invoke_am (flags=1, length=<optimized out>, data=0x7fd2c4feeac0, id=<optimized out>, iface=0x7fd2c5952200)
    at /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/uct/base/uct_iface.h:942
#22 uct_mm_iface_invoke_am (flags=1, length=<optimized out>, data=0x7fd2c4feeac0, am_id=<optimized out>, iface=0x7fd2c5952200) at sm/mm/base/mm_iface.h:278
#23 uct_mm_iface_process_recv (iface=0x7fd2c5952200) at sm/mm/base/mm_iface.c:321
#24 uct_mm_iface_poll_fifo (iface=0x7fd2c5952200) at sm/mm/base/mm_iface.c:353
#25 uct_mm_iface_progress (tl_iface=0x7fd2c5952200) at sm/mm/base/mm_iface.c:406
#26 0x00007fd55b56564a in ucs_callbackq_dispatch (cbq=<optimized out>) at /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/ucs/datastruct/callbackq.h:215
#27 uct_worker_progress (worker=<optimized out>) at /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/uct/api/uct.h:2787
#28 ucp_worker_progress (worker=0x7fd275772100) at core/ucp_worker.c:2996
#29 0x00007fd4ee742d81 in ucc_tl_ucp_test (task=0x7fd4db8d69c0) at bcast/../tl_ucp_coll.h:399
#30 ucc_tl_ucp_bcast_knomial_progress (coll_task=0x7fd4db8d69c0) at bcast/bcast_knomial.c:39
#31 0x00007fd5674d941e in ucc_pq_mt_progress (pq=0x7fd25cf16440) at core/ucc_progress_queue_mt.c:78
#32 0x00007fd5674d343d in ucc_progress_queue (pq=<optimized out>) at core/ucc_progress_queue.h:48
#33 ucc_context_progress (context=0x7fd2c53c5e80) at core/ucc_context.c:988
#34 0x00007fd56877aff3 in c10d::CommUCC::progress() () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#35 0x00007fd56876b5cd in c10d::Comm::progress_loop() () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#36 0x00007fd5664bc253 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#37 0x00007fd56624bac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#38 0x00007fd5662dd850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

@samnordmann
Copy link
Collaborator

@xwang233 I created a PR here but I am not managing to request your review. I might have done something wrong, let me know

https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/merge_requests/13

@xwang233
Copy link
Collaborator

xwang233 commented Oct 8, 2024

@xwang233 I created a PR here but I am not managing to request your review. I might have done something wrong, let me know

https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/merge_requests/13

LGTM. Thanks for the reminder. Feel free to cc me internally on MR in the future. 😄

@samnordmann
Copy link
Collaborator

POR ticket for the long term fix: https://redmine.mellanox.com/issues/3831841

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Multidevice
Projects
None yet
Development

No branches or pull requests

3 participants