Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add back the UCC backend for Bcast_sharded/PipelineTestTwoStages tests. #3124

Open
wujingyue opened this issue Oct 7, 2024 · 13 comments
Open
Assignees

Comments

@wujingyue
Copy link
Collaborator

wujingyue commented Oct 7, 2024

These tests have been disabled by #2794 but should be fixed.

To reproduce,

$ git revert -c 90260eff23372029e58656ea614b8eaab211ac5e
$ _bn && mpirun -np 4 bin/test_multidevice --gtest_filter=Bcast_sharded/PipelineTestTwoStages.Communication/20

The symptom appears to be non-deterministic. Sometimes the test hangs, sometimes it segfaults.

I ran into this on viking-prod-231. I'm unsure if it's machine or GPU dependent.

@samnordmann
Copy link
Collaborator

samnordmann commented Oct 8, 2024

The issue /usr/local/ucx/lib/ucx/libuct_cuda_gdrcopy.so.0: undefined symbol: gdr_get_info_v2 reported here is due to a UCX bug. This bug was the reason why we disabled the tests in #2794. We have filed an issue and the fix is in progress
https://redmine.mellanox.com/issues/4088373. I am still able to reproduce this bug onviking-prod-237+pjnl-latest

However, I am not able to reproduce the "hang" you are mentioning, on the setup mentioned above.

@wujingyue
Copy link
Collaborator Author

It's interesting that we are seeing different symptoms. Anyhow, FWIW, the repro in OP gives me the following on viking-prod-229 + pjnl-latest.

Note: Google Test filter = Bcast_sharded/PipelineTestTwoStages.Communication/20
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from Bcast_sharded/PipelineTestTwoStages
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/20
Note: Google Test filter = Bcast_sharded/PipelineTestTwoStages.Communication/20
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from Bcast_sharded/PipelineTestTwoStages
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/20
Note: Google Test filter = Bcast_sharded/PipelineTestTwoStages.Communication/20
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from Bcast_sharded/PipelineTestTwoStages
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/20
Note: Google Test filter = Bcast_sharded/PipelineTestTwoStages.Communication/20
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from Bcast_sharded/PipelineTestTwoStages
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/20
[viking-prod-229:2096 :0:2096] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
==== backtrace (tid:   2096) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000000aebbd ucp_address_unpack()  /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/ucp/wireup/address.c:1646
 2 0x00000000000aebbd ucp_address_unpack()  /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/ucp/wireup/address.c:1648
 3 0x000000000003f1c9 ucp_ep_create_api_to_worker_addr()  /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/ucp/core/ucp_ep.c:1056
 4 0x000000000003f1c9 ucp_ep_create()  /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx-39c8f9b/src/ucp/core/ucp_ep.c:1195
 5 0x000000000000ff37 ucc_tl_ucp_connect_ep()  /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucc-a0c139fe1e91b28681018a196e53510044322530/src/components/tl/ucp/tl_ucp_ep.c:40
 6 0x000000000000ff37 ucc_tl_ucp_connect_team_ep()  /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucc-a0c139fe1e91b28681018a196e53510044322530/src/components/tl/ucp/tl_ucp_ep.c:62
 7 0x000000000002e342 ucc_tl_ucp_get_ep()  /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucc-a0c139fe1e91b28681018a196e53510044322530/src/components/tl/ucp/./tl_ucp_ep.h:77
 8 0x000000000002e342 ucc_tl_ucp_send_common()  /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucc-a0c139fe1e91b28681018a196e53510044322530/src/components/tl/ucp/./tl_ucp_sendrecv.h:79
 9 0x000000000002e342 ucc_tl_ucp_send_nb()  /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucc-a0c139fe1e91b28681018a196e53510044322530/src/components/tl/ucp/./tl_ucp_sendrecv.h:104
10 0x000000000002efa0 ucc_tl_ucp_allreduce_knomial_progress()  /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucc-a0c139fe1e91b28681018a196e53510044322530/src/components/tl/ucp/allreduce/allreduce_knomial.c:103
11 0x000000000002d7b9 ucc_progress_queue_enqueue()  /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucc-a0c139fe1e91b28681018a196e53510044322530/src/core/ucc_progress_queue.h:35
12 0x000000000002d7b9 ucc_tl_ucp_allreduce_knomial_start()  /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucc-a0c139fe1e91b28681018a196e53510044322530/src/components/tl/ucp/allreduce/allreduce_knomial.c:211
13 0x0000000000011330 ucc_tl_ucp_service_allreduce()  /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucc-a0c139fe1e91b28681018a196e53510044322530/src/components/tl/ucp/tl_ucp_service_coll.c:125
14 0x0000000000013b73 ucc_service_allreduce()  /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucc-a0c139fe1e91b28681018a196e53510044322530/src/core/ucc_service_coll.c:67
15 0x0000000000010461 ucc_team_alloc_id()  /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucc-a0c139fe1e91b28681018a196e53510044322530/src/core/ucc_team.c:608
16 0x0000000000010461 ucc_team_create_test_single()  /build-result/src/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucc-a0c139fe1e91b28681018a196e53510044322530/src/core/ucc_team.c:443
17 0x0000000001173780 c10d::Comm::ucc_create_team()  :0
18 0x000000000117a551 c10d::ProcessGroupUCC::initComm()  ???:0
19 0x000000000118b57c c10d::ProcessGroupUCC::recv()  ???:0
20 0x00000000005523ef nvfuser::postSingleCommunication()  :0
21 0x00000000003ac568 nvfuser::hir::HostIrExecutor::handle()  ???:0
22 0x00000000003abf70 nvfuser::hir::HostIrExecutor::runWithInput()  :0
23 0x000000000055f540 nvfuser::MultiDeviceExecutor::runWithInput()  :0
24 0x000000000082c02e nvfuser::PipelineTest::executeAndValidate()  ???:0
25 0x000000000082d73d nvfuser::PipelineTestTwoStages_Communication_Test::TestBody()  ???:0
26 0x00000000008bfc41 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>()  :0
27 0x00000000008ab8d5 testing::Test::Run()  gtest-all.cc:0
28 0x00000000008ac062 testing::TestInfo::Run()  ???:0
29 0x00000000008acbeb testing::TestSuite::Run()  gtest-all.cc:0
30 0x00000000008b5244 testing::internal::UnitTestImpl::RunAllTests()  ???:0
31 0x00000000008ac245 testing::UnitTest::Run()  ???:0
32 0x000000000013ae9c main()  ???:0
33 0x0000000000029d90 __libc_init_first()  ???:0
34 0x0000000000029e40 __libc_start_main()  ???:0
35 0x0000000000140015 _start()  ???:0
=================================
[W1008 10:10:16.315006275 TCPStore.cpp:141] [c10d] recvValue failed on SocketImpl(fd=6, addr=[localhost]:55712, remote=[localhost]:29542): Connection reset by peer
Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:667 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f553eca13dc in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x59bded2 (0x7f55337dced2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x59bf0e4 (0x7f55337de0e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x59c02d0 (0x7f55337df2d0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x249 (0x7f55337d98c9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x3d (0x7f553377bbbd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x119a267 (0x7f55036e3267 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: ucc_core_addr_exchange + 0x3c (0x7f550251c99c in /opt/hpcx/ucc/lib/cmake/ucc/../../../lib/libucc.so.1)
frame #8: ucc_context_create_proc_info + 0x853 (0x7f550251d5e3 in /opt/hpcx/ucc/lib/cmake/ucc/../../../lib/libucc.so.1)
frame #9: <unknown function> + 0x119d0ae (0x7f55036e60ae in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0x1178987 (0x7f55036c1987 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x1179734 (0x7f55036c2734 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #12: c10d::ProcessGroupUCC::initComm(c10::Device) + 0x28a (0x7f55036c340a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #13: c10d::ProcessGroupUCC::send(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5c (0x7f55036d342c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #14: <unknown function> + 0x5523ef (0x5558185a73ef in bin/test_multidevice)
frame #15: <unknown function> + 0x3ac568 (0x555818401568 in bin/test_multidevice)
frame #16: <unknown function> + 0x3abf70 (0x555818400f70 in bin/test_multidevice)
frame #17: <unknown function> + 0x55f540 (0x5558185b4540 in bin/test_multidevice)
frame #18: <unknown function> + 0x82c02e (0x55581888102e in bin/test_multidevice)
frame #19: <unknown function> + 0x82d73d (0x55581888273d in bin/test_multidevice)
frame #20: <unknown function> + 0x8bfc41 (0x555818914c41 in bin/test_multidevice)
frame #21: <unknown function> + 0x8ab8d5 (0x5558189008d5 in bin/test_multidevice)
frame #22: <unknown function> + 0x8ac062 (0x555818901062 in bin/test_multidevice)
frame #23: <unknown function> + 0x8acbeb (0x555818901beb in bin/test_multidevice)
frame #24: <unknown function> + 0x8b5244 (0x55581890a244 in bin/test_multidevice)
frame #25: <unknown function> + 0x8ac245 (0x555818901245 in bin/test_multidevice)
frame #26: <unknown function> + 0x13ae9c (0x55581818fe9c in bin/test_multidevice)
frame #27: <unknown function> + 0x29d90 (0x7f55011e0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #28: __libc_start_main + 0x80 (0x7f55011e0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #29: <unknown function> + 0x140015 (0x555818195015 in bin/test_multidevice)

[E1008 10:10:16.319658610 UCCUtils.cpp:63] (oob_allgather) Caught exception in Store Operation .. [Connection reset by peer
Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:667 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f553eca13dc in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x59bded2 (0x7f55337dced2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x59bf0e4 (0x7f55337de0e4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x59c02d0 (0x7f55337df2d0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x249 (0x7f55337d98c9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x3d (0x7f553377bbbd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x119a267 (0x7f55036e3267 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: ucc_core_addr_exchange + 0x3c (0x7f550251c99c in /opt/hpcx/ucc/lib/cmake/ucc/../../../lib/libucc.so.1)
frame #8: ucc_context_create_proc_info + 0x853 (0x7f550251d5e3 in /opt/hpcx/ucc/lib/cmake/ucc/../../../lib/libucc.so.1)
frame #9: <unknown function> + 0x119d0ae (0x7f55036e60ae in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0x1178987 (0x7f55036c1987 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x1179734 (0x7f55036c2734 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #12: c10d::ProcessGroupUCC::initComm(c10::Device) + 0x28a (0x7f55036c340a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #13: c10d::ProcessGroupUCC::send(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5c (0x7f55036d342c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #14: <unknown function> + 0x5523ef (0x5558185a73ef in bin/test_multidevice)
frame #15: <unknown function> + 0x3ac568 (0x555818401568 in bin/test_multidevice)
frame #16: <unknown function> + 0x3abf70 (0x555818400f70 in bin/test_multidevice)
frame #17: <unknown function> + 0x55f540 (0x5558185b4540 in bin/test_multidevice)
frame #18: <unknown function> + 0x82c02e (0x55581888102e in bin/test_multidevice)
frame #19: <unknown function> + 0x82d73d (0x55581888273d in bin/test_multidevice)
frame #20: <unknown function> + 0x8bfc41 (0x555818914c41 in bin/test_multidevice)
frame #21: <unknown function> + 0x8ab8d5 (0x5558189008d5 in bin/test_multidevice)
frame #22: <unknown function> + 0x8ac062 (0x555818901062 in bin/test_multidevice)
frame #23: <unknown function> + 0x8acbeb (0x555818901beb in bin/test_multidevice)
frame #24: <unknown function> + 0x8b5244 (0x55581890a244 in bin/test_multidevice)
frame #25: <unknown function> + 0x8ac245 (0x555818901245 in bin/test_multidevice)
frame #26: <unknown function> + 0x13ae9c (0x55581818fe9c in bin/test_multidevice)
frame #27: <unknown function> + 0x29d90 (0x7f55011e0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #28: __libc_start_main + 0x80 (0x7f55011e0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #29: <unknown function> + 0x140015 (0x555818195015 in bin/test_multidevice)
]
[W1008 10:10:16.319696018 TCPStore.cpp:122] [c10d] sendBytes failed on SocketImpl(fd=6, addr=[localhost]:55712, remote=[localhost]:29542): Broken pipe
Exception raised from sendBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:645 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f553eca13dc in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x59bded2 (0x7f55337dced2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x59be333 (0x7f55337dd333 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x59c0784 (0x7f55337df784 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::incrementValueBy(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) + 0x155 (0x7f55337d9445 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::TCPStore::add(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) + 0xb4 (0x7f55337d9574 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::add(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) + 0x49 (0x7f553377b6e9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x119acc1 (0x7f55036e3cc1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #8: ucc_core_addr_exchange + 0x2be (0x7f550251cc1e in /opt/hpcx/ucc/lib/cmake/ucc/../../../lib/libucc.so.1)
frame #9: ucc_context_create_proc_info + 0x853 (0x7f550251d5e3 in /opt/hpcx/ucc/lib/cmake/ucc/../../../lib/libucc.so.1)
frame #10: <unknown function> + 0x119d0ae (0x7f55036e60ae in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x1178987 (0x7f55036c1987 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #12: <unknown function> + 0x1179734 (0x7f55036c2734 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #13: c10d::ProcessGroupUCC::initComm(c10::Device) + 0x28a (0x7f55036c340a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #14: c10d::ProcessGroupUCC::send(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5c (0x7f55036d342c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #15: <unknown function> + 0x5523ef (0x5558185a73ef in bin/test_multidevice)
frame #16: <unknown function> + 0x3ac568 (0x555818401568 in bin/test_multidevice)
frame #17: <unknown function> + 0x3abf70 (0x555818400f70 in bin/test_multidevice)
frame #18: <unknown function> + 0x55f540 (0x5558185b4540 in bin/test_multidevice)
frame #19: <unknown function> + 0x82c02e (0x55581888102e in bin/test_multidevice)
frame #20: <unknown function> + 0x82d73d (0x55581888273d in bin/test_multidevice)
frame #21: <unknown function> + 0x8bfc41 (0x555818914c41 in bin/test_multidevice)
frame #22: <unknown function> + 0x8ab8d5 (0x5558189008d5 in bin/test_multidevice)
frame #23: <unknown function> + 0x8ac062 (0x555818901062 in bin/test_multidevice)
frame #24: <unknown function> + 0x8acbeb (0x555818901beb in bin/test_multidevice)
frame #25: <unknown function> + 0x8b5244 (0x55581890a244 in bin/test_multidevice)
frame #26: <unknown function> + 0x8ac245 (0x555818901245 in bin/test_multidevice)
frame #27: <unknown function> + 0x13ae9c (0x55581818fe9c in bin/test_multidevice)
frame #28: <unknown function> + 0x29d90 (0x7f55011e0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #29: __libc_start_main + 0x80 (0x7f55011e0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #30: <unknown function> + 0x140015 (0x555818195015 in bin/test_multidevice)

[E1008 10:10:16.324580826 UCCUtils.cpp:93] (oob_allgather) Caught exception in Store Operation .. [Broken pipe
Exception raised from sendBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:645 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f553eca13dc in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x59bded2 (0x7f55337dced2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x59be333 (0x7f55337dd333 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x59c0784 (0x7f55337df784 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::incrementValueBy(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) + 0x155 (0x7f55337d9445 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::TCPStore::add(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) + 0xb4 (0x7f55337d9574 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::add(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) + 0x49 (0x7f553377b6e9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x119acc1 (0x7f55036e3cc1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #8: ucc_core_addr_exchange + 0x2be (0x7f550251cc1e in /opt/hpcx/ucc/lib/cmake/ucc/../../../lib/libucc.so.1)
frame #9: ucc_context_create_proc_info + 0x853 (0x7f550251d5e3 in /opt/hpcx/ucc/lib/cmake/ucc/../../../lib/libucc.so.1)
frame #10: <unknown function> + 0x119d0ae (0x7f55036e60ae in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x1178987 (0x7f55036c1987 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #12: <unknown function> + 0x1179734 (0x7f55036c2734 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #13: c10d::ProcessGroupUCC::initComm(c10::Device) + 0x28a (0x7f55036c340a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #14: c10d::ProcessGroupUCC::send(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5c (0x7f55036d342c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #15: <unknown function> + 0x5523ef (0x5558185a73ef in bin/test_multidevice)
frame #16: <unknown function> + 0x3ac568 (0x555818401568 in bin/test_multidevice)
frame #17: <unknown function> + 0x3abf70 (0x555818400f70 in bin/test_multidevice)
frame #18: <unknown function> + 0x55f540 (0x5558185b4540 in bin/test_multidevice)
frame #19: <unknown function> + 0x82c02e (0x55581888102e in bin/test_multidevice)
frame #20: <unknown function> + 0x82d73d (0x55581888273d in bin/test_multidevice)
frame #21: <unknown function> + 0x8bfc41 (0x555818914c41 in bin/test_multidevice)
frame #22: <unknown function> + 0x8ab8d5 (0x5558189008d5 in bin/test_multidevice)
frame #23: <unknown function> + 0x8ac062 (0x555818901062 in bin/test_multidevice)
frame #24: <unknown function> + 0x8acbeb (0x555818901beb in bin/test_multidevice)
frame #25: <unknown function> + 0x8b5244 (0x55581890a244 in bin/test_multidevice)
frame #26: <unknown function> + 0x8ac245 (0x555818901245 in bin/test_multidevice)
frame #27: <unknown function> + 0x13ae9c (0x55581818fe9c in bin/test_multidevice)
frame #28: <unknown function> + 0x29d90 (0x7f55011e0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #29: __libc_start_main + 0x80 (0x7f55011e0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #30: <unknown function> + 0x140015 (0x555818195015 in bin/test_multidevice)
]
[1728407416.717817] [viking-prod-229:2098 :0]     ucc_context.c:464  UCC  ERROR oob req test failed during team addr exchange
[1728407416.717832] [viking-prod-229:2098 :0]     ucc_context.c:726  UCC  ERROR failed to exchange addresses during context creation
[E1008 10:10:16.324648470 UCCUtils.cpp:169] [Rank 0][ProcessGroupUCC-0][INIT][ERROR] UCC failed to create UCC context: Unhandled error
unknown file: Failure
C++ exception with description "Unhandled error" thrown in the test body.

To reproduce: NVFUSER_TEST_RANDOM_SEED=1728407416 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='Bcast_sharded/PipelineTestTwoStages.Communication/20'
[W1008 10:10:16.324899359 TCPStore.cpp:122] [c10d] sendBytes failed on SocketImpl(fd=6, addr=[localhost]:55712, remote=[localhost]:29542): Broken pipe
Exception raised from sendBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:645 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f553eca13dc in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x59bded2 (0x7f55337dced2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x59be333 (0x7f55337dd333 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x59c0784 (0x7f55337df784 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::deleteKey(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x171 (0x7f55337dc6d1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::deleteKey(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x41 (0x7f553377b781 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x1178d81 (0x7f55036c1d81 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: c10d::ProcessGroupUCC::initComm(c10::Device) + 0x28a (0x7f55036c340a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #8: c10d::ProcessGroupUCC::barrier(c10d::BarrierOptions const&) + 0x76 (0x7f55036cb646 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #9: <unknown function> + 0x55c289 (0x5558185b1289 in bin/test_multidevice)
frame #10: <unknown function> + 0x7e114f (0x55581883614f in bin/test_multidevice)
frame #11: <unknown function> + 0x836137 (0x55581888b137 in bin/test_multidevice)
frame #12: <unknown function> + 0x8bfc41 (0x555818914c41 in bin/test_multidevice)
frame #13: <unknown function> + 0x8abecc (0x555818900ecc in bin/test_multidevice)
frame #14: <unknown function> + 0x8acbeb (0x555818901beb in bin/test_multidevice)
frame #15: <unknown function> + 0x8b5244 (0x55581890a244 in bin/test_multidevice)
frame #16: <unknown function> + 0x8ac245 (0x555818901245 in bin/test_multidevice)
frame #17: <unknown function> + 0x13ae9c (0x55581818fe9c in bin/test_multidevice)
frame #18: <unknown function> + 0x29d90 (0x7f55011e0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: __libc_start_main + 0x80 (0x7f55011e0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #20: <unknown function> + 0x140015 (0x555818195015 in bin/test_multidevice)

terminate called after throwing an instance of 'c10::DistNetworkError'
  what():  Broken pipe
Exception raised from sendBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:645 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f553eca13dc in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x59bded2 (0x7f55337dced2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x59be333 (0x7f55337dd333 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x59c0784 (0x7f55337df784 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::deleteKey(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x171 (0x7f55337dc6d1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::deleteKey(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x41 (0x7f553377b781 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x1178d81 (0x7f55036c1d81 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: c10d::ProcessGroupUCC::initComm(c10::Device) + 0x28a (0x7f55036c340a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #8: c10d::ProcessGroupUCC::barrier(c10d::BarrierOptions const&) + 0x76 (0x7f55036cb646 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #9: <unknown function> + 0x55c289 (0x5558185b1289 in bin/test_multidevice)
frame #10: <unknown function> + 0x7e114f (0x55581883614f in bin/test_multidevice)
frame #11: <unknown function> + 0x836137 (0x55581888b137 in bin/test_multidevice)
frame #12: <unknown function> + 0x8bfc41 (0x555818914c41 in bin/test_multidevice)
frame #13: <unknown function> + 0x8abecc (0x555818900ecc in bin/test_multidevice)
frame #14: <unknown function> + 0x8acbeb (0x555818901beb in bin/test_multidevice)
frame #15: <unknown function> + 0x8b5244 (0x55581890a244 in bin/test_multidevice)
frame #16: <unknown function> + 0x8ac245 (0x555818901245 in bin/test_multidevice)
frame #17: <unknown function> + 0x13ae9c (0x55581818fe9c in bin/test_multidevice)
frame #18: <unknown function> + 0x29d90 (0x7f55011e0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: __libc_start_main + 0x80 (0x7f55011e0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #20: <unknown function> + 0x140015 (0x555818195015 in bin/test_multidevice)

To reproduce: NVFUSER_TEST_RANDOM_SEED=1728407416 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='Bcast_sharded/PipelineTestTwoStages.Communication/20'
unknown file: Failure
C++ exception with description "failed to recv, got 0 bytes
Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f7adccc33dc in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x59bded2 (0x7f7ad1934ed2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x59bf255 (0x7f7ad1936255 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2fb (0x7f7ad193223b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x47 (0x7f7ad19325e7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xb5 (0x7f7ad1933885 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x43 (0x7f7ad18d3653 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x1178dfd (0x7f7aa1819dfd in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #8: c10d::ProcessGroupUCC::initComm(c10::Device) + 0x28a (0x7f7aa181b40a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupUCC::send(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5c (0x7f7aa182b42c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0x5523ef (0x55c7db59f3ef in bin/test_multidevice)
frame #11: <unknown function> + 0x3ac568 (0x55c7db3f9568 in bin/test_multidevice)
frame #12: <unknown function> + 0x3abf70 (0x55c7db3f8f70 in bin/test_multidevice)
frame #13: <unknown function> + 0x55f540 (0x55c7db5ac540 in bin/test_multidevice)
frame #14: <unknown function> + 0x82c02e (0x55c7db87902e in bin/test_multidevice)
frame #15: <unknown function> + 0x82d73d (0x55c7db87a73d in bin/test_multidevice)
frame #16: <unknown function> + 0x8bfc41 (0x55c7db90cc41 in bin/test_multidevice)
frame #17: <unknown function> + 0x8ab8d5 (0x55c7db8f88d5 in bin/test_multidevice)
frame #18: <unknown function> + 0x8ac062 (0x55c7db8f9062 in bin/test_multidevice)
frame #19: <unknown function> + 0x8acbeb (0x55c7db8f9beb in bin/test_multidevice)
frame #20: <unknown function> + 0x8b5244 (0x55c7db902244 in bin/test_multidevice)
frame #21: <unknown function> + 0x8ac245 (0x55c7db8f9245 in bin/test_multidevice)
frame #22: <unknown function> + 0x13ae9c (0x55c7db187e9c in bin/test_multidevice)
frame #23: <unknown function> + 0x29d90 (0x7f7a9f1e0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #24: __libc_start_main + 0x80 (0x7f7a9f1e0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #25: <unknown function> + 0x140015 (0x55c7db18d015 in bin/test_multidevice)
" thrown in the test body.

[W1008 10:10:16.541670302 TCPStore.cpp:141] [c10d] recvValue failed on SocketImpl(fd=5, addr=[localhost]:55696, remote=[localhost]:29542): failed to recv, got 0 bytes
Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f7adccc33dc in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x59bded2 (0x7f7ad1934ed2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x59bee2d (0x7f7ad1935e2d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x59c0460 (0x7f7ad1937460 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::deleteKey(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x17f (0x7f7ad19346df in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::deleteKey(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x41 (0x7f7ad18d3781 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x1178d81 (0x7f7aa1819d81 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: c10d::ProcessGroupUCC::initComm(c10::Device) + 0x28a (0x7f7aa181b40a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #8: c10d::ProcessGroupUCC::barrier(c10d::BarrierOptions const&) + 0x76 (0x7f7aa1823646 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #9: <unknown function> + 0x55c289 (0x55c7db5a9289 in bin/test_multidevice)
frame #10: <unknown function> + 0x7e114f (0x55c7db82e14f in bin/test_multidevice)
frame #11: <unknown function> + 0x836137 (0x55c7db883137 in bin/test_multidevice)
frame #12: <unknown function> + 0x8bfc41 (0x55c7db90cc41 in bin/test_multidevice)
frame #13: <unknown function> + 0x8abecc (0x55c7db8f8ecc in bin/test_multidevice)
frame #14: <unknown function> + 0x8acbeb (0x55c7db8f9beb in bin/test_multidevice)
frame #15: <unknown function> + 0x8b5244 (0x55c7db902244 in bin/test_multidevice)
frame #16: <unknown function> + 0x8ac245 (0x55c7db8f9245 in bin/test_multidevice)
frame #17: <unknown function> + 0x13ae9c (0x55c7db187e9c in bin/test_multidevice)
frame #18: <unknown function> + 0x29d90 (0x7f7a9f1e0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: __libc_start_main + 0x80 (0x7f7a9f1e0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #20: <unknown function> + 0x140015 (0x55c7db18d015 in bin/test_multidevice)

terminate called after throwing an instance of 'c10::DistNetworkError'
  what():  failed to recv, got 0 bytes
Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f7adccc33dc in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x59bded2 (0x7f7ad1934ed2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x59bee2d (0x7f7ad1935e2d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x59c0460 (0x7f7ad1937460 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::deleteKey(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x17f (0x7f7ad19346df in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::deleteKey(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x41 (0x7f7ad18d3781 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x1178d81 (0x7f7aa1819d81 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: c10d::ProcessGroupUCC::initComm(c10::Device) + 0x28a (0x7f7aa181b40a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #8: c10d::ProcessGroupUCC::barrier(c10d::BarrierOptions const&) + 0x76 (0x7f7aa1823646 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #9: <unknown function> + 0x55c289 (0x55c7db5a9289 in bin/test_multidevice)
frame #10: <unknown function> + 0x7e114f (0x55c7db82e14f in bin/test_multidevice)
frame #11: <unknown function> + 0x836137 (0x55c7db883137 in bin/test_multidevice)
frame #12: <unknown function> + 0x8bfc41 (0x55c7db90cc41 in bin/test_multidevice)
frame #13: <unknown function> + 0x8abecc (0x55c7db8f8ecc in bin/test_multidevice)
frame #14: <unknown function> + 0x8acbeb (0x55c7db8f9beb in bin/test_multidevice)
frame #15: <unknown function> + 0x8b5244 (0x55c7db902244 in bin/test_multidevice)
frame #16: <unknown function> + 0x8ac245 (0x55c7db8f9245 in bin/test_multidevice)
frame #17: <unknown function> + 0x13ae9c (0x55c7db187e9c in bin/test_multidevice)
frame #18: <unknown function> + 0x29d90 (0x7f7a9f1e0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: __libc_start_main + 0x80 (0x7f7a9f1e0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #20: <unknown function> + 0x140015 (0x55c7db18d015 in bin/test_multidevice)

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node viking-prod-229 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@samnordmann
Copy link
Collaborator

interesting! What is your docker run command ?

@wujingyue
Copy link
Collaborator Author

Also, if I ran all tests matching Bcast_sharded/PipelineTestTwoStages* instead of just Bcast_sharded/PipelineTestTwoStages.Communication/20, I got a hang:

$ _bn && mpirun -np 4 bin/test_multidevice --gtest_filter=Bcast_sharded/PipelineTestTwoStages*

Note: Google Test filter = Bcast_sharded/PipelineTestTwoStages*
[==========] Running 32 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 32 tests from Bcast_sharded/PipelineTestTwoStages
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/0
Note: Google Test filter = Bcast_sharded/PipelineTestTwoStages*
[==========] Running 32 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 32 tests from Bcast_sharded/PipelineTestTwoStages
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/0
Note: Google Test filter = Bcast_sharded/PipelineTestTwoStages*
[==========] Running 32 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 32 tests from Bcast_sharded/PipelineTestTwoStages
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/0
Note: Google Test filter = Bcast_sharded/PipelineTestTwoStages*
[==========] Running 32 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 32 tests from Bcast_sharded/PipelineTestTwoStages
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/0
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/0 (15349 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/1
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/0 (15353 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/1
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/0 (15344 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/1
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/0 (15349 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/1
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/1 (164 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/2
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/1 (164 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/2
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/1 (164 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/2
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/1 (164 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/2
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/2 (165 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/3
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/2 (165 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/3
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/2 (165 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/3
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/2 (165 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/3
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/3 (159 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/4
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/3 (159 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/4
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/3 (159 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/4
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/3 (159 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/4
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/4 (2285 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/5
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/4 (2285 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/5
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/4 (2285 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/5
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/4 (2285 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/5
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/5 (204 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/6
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/5 (204 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/6
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/5 (205 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/6
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/5 (205 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/6
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/6 (203 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/7
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/6 (203 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/7
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/6 (203 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/7
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/6 (203 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/7
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/7 (203 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/8
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/7 (203 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/8
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/7 (203 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/8
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/7 (203 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/8
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/8 (2183 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/9
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/8 (2183 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/9
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/8 (2183 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/9
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/8 (2183 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/9
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/9 (205 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/10
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/9 (206 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/10
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/9 (206 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/10
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/9 (206 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/10
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/10 (202 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/11
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/10 (202 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/11
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/10 (202 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/11
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/10 (202 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/11
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/11 (213 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/12
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/11 (213 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/12
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/11 (213 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/12
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/11 (213 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/12
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/12 (150 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/13
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/12 (150 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/13
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/12 (150 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/13
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/12 (150 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/13
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/13 (151 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/14
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/13 (151 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/14
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/13 (151 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/14
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/13 (151 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/14
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/14 (160 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/15
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/14 (160 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/15
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/14 (160 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/15
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/14 (160 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/15
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/15 (158 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/16
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/15 (158 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/16
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/15 (158 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/16
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/15 (158 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/16
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/16 (2548 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/17
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/16 (2548 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/17
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/16 (2548 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/17
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/16 (2549 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/17
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/17 (159 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/18
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/17 (159 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/18
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/17 (158 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/18
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/17 (159 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/18
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/18 (165 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/19
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/18 (165 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/19
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/18 (165 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/19
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/18 (165 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/19
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/19 (158 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/20
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/19 (158 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/20
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/19 (158 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/20
[       OK ] Bcast_sharded/PipelineTestTwoStages.Communication/19 (158 ms)
[ RUN      ] Bcast_sharded/PipelineTestTwoStages.Communication/20

@wujingyue
Copy link
Collaborator Author

interesting! What is your docker run command ?

I'll get back to you. There are some personal touches (such as apt install more tools) in my Dockerfile. I don't think they are related but I'll double check on a clean docker build.

@wujingyue
Copy link
Collaborator Author

FWIW, https://gitlab-master.nvidia.com/jingyuew/pjnl contains my Dockerfile, the build script and run.

@wujingyue
Copy link
Collaborator Author

@samnordmann here's a repro with a clean image.

jingyuew@viking-prod-231:~$ docker images | grep pjnl-latest
gitlab-master.nvidia.com/dl/pytorch/update-scripts          pjnl-latest                                                   da1aefd7528f   2 days ago      16.8GB

jingyuew@viking-prod-231:~$ docker run --rm --gpus=all --net=host --ipc=host --cap-add=SYS_ADMIN --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size=1g --ulimit memlock=-1 gitlab-master.nvidia.com/dl/pytorch/update-scripts:pjnl-latest sh -c 'git revert -c 90260eff23372029e58656ea614b8eaab211ac5e && _bn && mpirun -np 4 bin/test_multidevice --gtest_filter=Bcast_sharded/PipelineTestTwoStages.Communication/20'

Same segfault as #3124 (comment).

@samnordmann
Copy link
Collaborator

Thank you. I'm indeed able to reproduce, even with UCX's fix merged openucx/ucx#10195. My guess is that the original bug with gdr_copy that we saw in the CI (and that I could reproduce) is gone but now we are seeing another segfault.

I need to investigate and probably escalate to UCX

@samnordmann
Copy link
Collaborator

samnordmann commented Oct 15, 2024

Regardless of my last message, I realized that CI is running an old-ish stable version of ucx and not master nightly. IIUC, it takes what's provided by the pytorch stable container.
On the other hand, pjnl-latest, which I'm using, has nightly UCX installed.

I understand that stable releases are preferable for CI. But because of that, in the present case, we will see the bug in CI for as long as the ucx version doesn't change.

@wujingyue
Copy link
Collaborator Author

cc @xwang233 to comment on the versions. I thought our CI (github or nightly) uses pjnl-latest so the versions should match. But apparently no?

@xwang233
Copy link
Collaborator

We don't modify HPCX in pjnl-latest image, which inherits the HPCX version from internal upstream base image. Also, pjnl-latest is the image that we use in CI.

The versions in github (pjnl-latest) or nightly may mismatch for 1 day at most.

Can you comment on a job log or a docker image where you see an old HPCX version?

@samnordmann
Copy link
Collaborator

samnordmann commented Oct 17, 2024

We don't modify HPCX in pjnl-latest image, which inherits the HPCX version from internal upstream base image. Also, pjnl-latest is the image that we use in CI.

Right, sorry I probably got confused.

For the record, in this log I see that the following image is used:
gitlab-master.nvidia.com:5005/dl/pytorch/fuser-gh-mirror:nvfuser-gh-ci-19364891-cpp17

Then doing docker run gitlab-master.nvidia.com:5005/dl/pytorch/fuser-gh-mirror:nvfuser-gh-ci-19364891-cpp17 ucx_info -v gives

# Library version: 1.17.0
# Library path: /opt/hpcx/ucx/lib/libucs.so.0
# API headers version: 1.17.0
# Git branch '', revision 39c8f9b
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.5.1/ubuntu22.04 --with-gdrcopy --prefix=/build-result/hpcx-v2.20-gcc-inbox-ubuntu22.04-cuda12-x86_64/ucx/mt --with-bfd=/hpc/local/oss/binutils/2.37/ubuntu22.04

which points to commit 39c8f9b which is the head of v1.17.x tag and dates back to last July.

pjnl-latest points to the same commit.

So: either we'll need to wait the next hpcx release to see some change -- or we move to nightly hpcx builds.

@xwang233
Copy link
Collaborator

So: either we'll need to wait the next hpcx release to see some change -- or we move to nightly hpcx builds.

Will follow up offline on the HPCX version update in our base image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants