Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in dcx/test_ucp_peer_failure_2pairs.status_after_error/0 #2537

Open
shamisp opened this issue Apr 19, 2018 · 6 comments
Open

Hang in dcx/test_ucp_peer_failure_2pairs.status_after_error/0 #2537

shamisp opened this issue Apr 19, 2018 · 6 comments
Assignees
Labels

Comments

@shamisp
Copy link
Contributor

shamisp commented Apr 19, 2018

$ make -C test/gtest/ GTEST_FILTER="*test_ucp_peer_failure_2pairs.status_after_error*" test

RHEL7.5, OFED-internal-4.3-1.0.1, UCX master

dc/test_ucp_peer_failure_2pairs.status_after_error/0 test runs fine.

@shamisp
Copy link
Contributor Author

shamisp commented Apr 19, 2018

@gmegan - FYI

@yosefe yosefe added the Bug label Apr 22, 2018
@yosefe yosefe assigned evgeny-leksikov and unassigned yosefe Apr 22, 2018
@yosefe
Copy link
Contributor

yosefe commented May 2, 2018

we saw this happen because of timeout on 2nd pair, so smoke_test wait-for-recv does not complete.
@shamisp do you see smoke_test() in the backtrace, if attaching with gdb?

@shamisp
Copy link
Contributor Author

shamisp commented May 2, 2018

@gmegan can you please double check ?

@gmegan
Copy link

gmegan commented May 3, 2018

Smoke test appears in the backtrace. For this test, it hung on udx/test_ucp_peer_failure_2pairs.status_after_error/0


gtest --gtest_filter=test_ucp_peer_failure_2pairs.status_after_error
[ INFO ] checking dc: no
[ INFO ] checking dcx: no
[ INFO ] checking ud: yes
[ INFO ] checking udx: yes
[ INFO ] checking rc: yes
[ INFO ] checking rcx: yes
[ INFO ] checking shm_ib: yes
[ INFO ] checking ugni: no
[ INFO ] checking self: yes
[ INFO ] checking tcp: yes
[1525358073.161028] [vulcan2:36017:0] ucp_context.c:557 UCX WARN transport '\dc' is not available
[ INFO ] checking all_rcdc: yes
[ INFO ] checking all: yes
[ INFO ] checking rc_x: yes
[ INFO ] checking ud_mlx5: yes
[ INFO ] checking shm: yes
[1525358073.297950] [vulcan2:36017:0] ucp_context.c:557 UCX WARN transport 'rdmacm' is not available
[ INFO ] checking mm_rdmacm: yes
[ INFO ] Using random seed of 7673
Note: Google Test filter = test_ucp_peer_failure_2pairs.status_after_error
[==========] Running 7 tests from 7 test cases.
[----------] Global test environment set-up.
[----------] 1 test from ud/test_ucp_peer_failure_2pairs
[ RUN ] ud/test_ucp_peer_failure_2pairs.status_after_error/0
[ OK ] ud/test_ucp_peer_failure_2pairs.status_after_error/0 (1138 ms)
[----------] 1 test from ud/test_ucp_peer_failure_2pairs (1138 ms total)

[----------] 1 test from udx/test_ucp_peer_failure_2pairs
[ RUN ] udx/test_ucp_peer_failure_2pairs.status_after_error/0


(gdb) backtrace
#0 0x0000ffffb1722fb4 in uct_ib_mlx5_poll_cq (cq=0x65dbea8, iface=0x65d9780)
at /home/meggro01/local/src/maas-tools/shmem-setup/BUILD/ucx/BUILD/../src/uct/ib/mlx5/ib_mlx5.inl:31
#1 uct_ud_mlx5_iface_poll_rx (is_async=0, iface=0x65d9780) at ../../../src/uct/ib/ud/accel/ud_mlx5.c:389
#2 uct_ud_mlx5_iface_progress (tl_iface=0x65d9780) at ../../../src/uct/ib/ud/accel/ud_mlx5.c:460
#3 0x0000ffffb15fc0b4 in ucs_callbackq_dispatch (cbq=)
at /home/meggro01/local/src/maas-tools/shmem-setup/BUILD/ucx/BUILD/../src/ucs/datastruct/callbackq.h:208
#4 uct_worker_progress (worker=)
at /home/meggro01/local/src/maas-tools/shmem-setup/BUILD/ucx/BUILD/../src/uct/api/uct.h:1650
#5 ucp_worker_progress (worker=0x661d1a0) at ../../../src/ucp/core/ucp_worker.c:1356
#6 0x0000000000588c54 in progress (this=0x656d360) at ../../../test/gtest/ucp/test_ucp_peer_failure.cc:546
#7 test_ucp_peer_failure_2pairs::wait_req (this=this@entry=0x656d360, req=req@entry=0x65f4e20)
at ../../../test/gtest/ucp/test_ucp_peer_failure.cc:571
#8 0x0000000000589184 in wait_req (req=0x65f4e20, this=0x656d360) at ../../../test/gtest/ucp/test_ucp_peer_failure.cc:598
#9 test_ucp_peer_failure_2pairs::smoke_test (this=this@entry=0x656d360, idx=idx@entry=0)
at ../../../test/gtest/ucp/test_ucp_peer_failure.cc:597
#10 0x0000000000590530 in test_ucp_peer_failure_2pairs::init (this=0x656d360) at ../../../test/gtest/ucp/test_ucp_peer_failure.cc:510
#11 0x000000000047a8c8 in ucs::test_base::SetUpProxy (this=0x656d360) at ../../../test/gtest/common/test.cc:229
#12 0x000000000046e06c in HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x79f5b8 "SetUp()",
method=, object=0x656d3c0) at ../../../test/gtest/common/gtest-all.cc:3562
#13 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0x656d3c0, method=,
location=0x79f5b8 "SetUp()") at ../../../test/gtest/common/gtest-all.cc:3598
#14 0x00000000004662e4 in testing::Test::Run (this=0x656d3c0) at ../../../test/gtest/common/gtest-all.cc:3630
#15 0x00000000004666ac in Run (this=0x656d3c0) at ../../../test/gtest/common/gtest-all.cc:3626
#16 testing::TestInfo::Run (this=0x62cdd90) at ../../../test/gtest/common/gtest-all.cc:3812
#17 0x000000000046695c in Run (this=) at ../../../test/gtest/common/gtest-all.cc:3940
#18 testing::TestCase::Run (this=0x62cdf00) at ../../../test/gtest/common/gtest-all.cc:3930
#19 0x0000000000467700 in Run (this=) at ../../../test/gtest/common/gtest-all.cc:3916
#20 testing::internal::UnitTestImpl::RunAllTests (this=this@entry=0x61c2e30) at ../../../test/gtest/common/gtest-all.cc:5802
#21 0x00000000004679c4 in RunAllTests (this=0x61c2e30) at ../../../test/gtest/common/gtest-all.cc:5719
#22 HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (
location=0x79f720 "auxiliary test code (environments or event listeners)", method=, object=0x61c2e30)
at ../../../test/gtest/common/gtest-all.cc:3562
#23 HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (
location=0x79f720 "auxiliary test code (environments or event listeners)",
method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x467908 testing::internal::UnitTestImpl::RunAllTests(), object=0x61c2e30) at ../../../test/gtest/common/gtest-all.cc:3598
#24 testing::UnitTest::Run (this=) at ../../../test/gtest/common/gtest-all.cc:5413
#25 0x0000000000414c64 in RUN_ALL_TESTS () at ../../../test/gtest/common/gtest.h:20059
#26 main (argc=, argv=0xffffd1ee0128) at ../../../test/gtest/common/main.cc:85
(gdb)

@yosefe
Copy link
Contributor

yosefe commented May 6, 2018

@shamisp @gmegan can you pls check if it hangs on "udx/test_ucp_peer_failure_2pairs.status_after_error/0"?
i could reproduce and fix udx/test_ucp_peer_failure_2pairs.status_after_error/0 only for now

@yosefe
Copy link
Contributor

yosefe commented Jun 26, 2019

Test hangs with valgrind:
http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/10817/label=hpc-test-node,worker=1/console

20:04:15 [ RUN      ] dcx/test_ucp_peer_failure.zcopy/0
20:19:15 /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/test_helpers.cc:46: Failure
20:19:15 Failed
20:19:15 Connection timed out - abort testing
20:19:15 [hpc-test-node:23748:0:23748] Caught signal 6 (Aborted: tkill(2) or tgkill(2))
20:21:40 ==== backtrace (tid:  23748) ====
20:21:40  0 0x00000000000dce45 __GI___sched_yield()  :0
20:21:40  1 0x000000000078665d ucp_test::progress()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/ucp/ucp_test.cc:148
20:21:40  2 0x000000000069fde2 test_ucp_peer_failure::do_test()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/ucp/test_ucp_peer_failure.cc:306
20:21:40  3 0x0000000000560066 ucs::test_base::run()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/test.cc:276
20:21:40  4 0x000000000054aba3 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:3562
20:21:40  5 0x000000000053f01d testing::Test::Run()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:3635
20:21:40  6 0x000000000053f0ec testing::TestInfo::Run()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:3812
20:21:40  7 0x000000000053f24f testing::TestCase::Run()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:3930
20:21:40  8 0x0000000000543be7 testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:5802
20:21:40  9 0x0000000000543eeb testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:5719
20:21:40 10 0x00000000004eb5b3 main()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest.h:20059
20:21:40 11 0x0000000000021c05 __libc_start_main()  ???:0
20:21:40 12 0x00000000005297a4 _start()  ???:0
20:21:40 =================================
20:21:40 Sending notification to [email protected]
20:23:57 [hpc-test-node:23748:0:23748] Process frozen...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants