We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A call to MPI_Init gives this error message:
MPI_Init
Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
This worked in UCX v1.11.2 and fails in UCX v1.12.0. Here is a backtrace:
Thread 1 "hello_world" received signal SIGSEGV, Segmentation fault. 0x00007fffe6bd9d5f in uct_base_iface_t_init (self=0x7eab60, _myclass=0x7fffe6e12d20 <uct_base_iface_t_class>, _init_count=0x7fffffffd778, ops=0x7fffe6523be0 <uct_rocm_ipc_iface_ops>, internal_ops=0x0, md=0x7fffe6523ad0 <md>, worker=0x7e5ba0, params=0x7fffffffd930, config=0x7e9cd0) at ../../../src/uct/base/uct_iface.c:511 511 ucs_assert(internal_ops->iface_estimate_perf != NULL); #0 0x00007fffe6bd9d5f in uct_base_iface_t_init (self=0x7eab60, _myclass=0x7fffe6e12d20 <uct_base_iface_t_class>, _init_count=0x7fffffffd778, ops=0x7fffe6523be0 <uct_rocm_ipc_iface_ops>, internal_ops=0x0, md=0x7fffe6523ad0 <md>, worker=0x7e5ba0, params=0x7fffffffd930, config=0x7e9cd0) at ../../../src/uct/base/uct_iface.c:511 #1 0x00007fffe631dd0d in uct_rocm_ipc_iface_t_init (self=0x7eab60, _myclass=0x7fffe6523dc0 <uct_rocm_ipc_iface_t_class>, _init_count=0x7fffffffd778, md=0x7fffe6523ad0 <md>, worker=0x7e5ba0, params=0x7fffffffd930, tl_config=0x7e9cd0) at ../../../../src/uct/rocm/ipc/rocm_ipc_iface.c:228 #2 0x00007fffe631dec2 in uct_rocm_ipc_iface_t_new (arg0=0x7fffe6523ad0 <md>, arg1=0x7e5ba0, arg2=0x7fffffffd930, arg3=0x7e9cd0, obj_p=0x7e9a60) at ../../../../src/uct/rocm/ipc/rocm_ipc_iface.c:262 #3 0x00007fffe6bd5df8 in uct_iface_open (md=0x7fffe6523ad0 <md>, worker=0x7e5ba0, params=0x7fffffffd930, config=0x7e9cd0, iface_p=0x7e9a60) at ../../../src/uct/base/uct_md.c:267 #4 0x00007fffe6e72456 in ucp_worker_iface_open (worker=0x7fffe4095010, tl_id=6 '\006', iface_params=0x7fffffffd930, wiface_p=0x7e5a80) at ../../../src/ucp/core/ucp_worker.c:1173 #5 0x00007fffe6e70853 in ucp_worker_add_resource_ifaces (worker=0x7fffe4095010) at ../../../src/ucp/core/ucp_worker.c:974 #6 0x00007fffe6e75d6c in ucp_worker_create (context=0x765a40, params=0x7fffffffdee0, worker_p=0x7fffe75e9340 <ompi_pml_ucx+192>) at ../../../src/ucp/core/ucp_worker.c:2210 #7 0x00007fffe73e0a8f in mca_pml_ucx_init (enable_mpi_threads=0) at ../../../../../ompi/mca/pml/ucx/pml_ucx.c:306 #8 0x00007fffe73e5c59 in mca_pml_ucx_component_init (priority=0x7fffffffe05c, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../ompi/mca/pml/ucx/pml_ucx_component.c:118 #9 0x00007ffff7b590d4 in mca_pml_base_select (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../ompi/mca/pml/base/pml_base_select.c:127 #10 0x00007ffff7b6d6e9 in ompi_mpi_init (argc=1, argv=0x7fffffffe328, requested=0, provided=0x7fffffffe1ec, reinit_ok=false) at ../../ompi/runtime/ompi_mpi_init.c:646 #11 0x00007ffff7aeb2e9 in PMPI_Init (argc=0x7fffffffe21c, argv=0x7fffffffe210) at pinit.c:67 #12 0x0000000000400709 in main (argc=1, argv=0x7fffffffe328) at hello_world.c:5
$ cat hello_world.c #include <mpi.h> #include <stdio.h> int main(int argc, char** argv) { MPI_Init(&argc, &argv); int world_size; int world_rank; MPI_Comm_size(MPI_COMM_WORLD, &world_size); MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); printf("Hello world from rank %d out of %d processors\n", world_rank, world_size); MPI_Finalize(); } $ mpicc -ggdb -O0 hello_world.c -o hello_world $ mpirun -np 1 ./hello_world [HOSTNAME:76168:0:76168] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) ==== backtrace (tid: 76168) ==== 0 /path/to/lib64/libucs.so.0(ucs_handle_error+0x73) [0x7f545c0ea0af] 1 /path/to/lib64/libucs.so.0(+0x32e88) [0x7f545c0e9e88] 2 /path/to/lib64/libucs.so.0(+0x32fcf) [0x7f545c0e9fcf] 3 /path/to/lib64/libuct.so.0(uct_base_iface_t_init+0xf4) [0x7f545c33dd5f] 4 /path/to/lib64/ucx/libuct_rocm.so.0(+0x7d0d) [0x7f54579cad0d] 5 /path/to/lib64/ucx/libuct_rocm.so.0(+0x7ec2) [0x7f54579caec2] 6 /path/to/lib64/libuct.so.0(uct_iface_open+0x18f) [0x7f545c339df8] 7 /path/to/lib64/libucp.so.0(ucp_worker_iface_open+0x49f) [0x7f545c5d6456] 8 /path/to/lib64/libucp.so.0(+0x5a853) [0x7f545c5d4853] 9 /path/to/lib64/libucp.so.0(ucp_worker_create+0x6b2) [0x7f545c5d9d6c] 10 /path/to/lib64/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0xcb) [0x7f545cb44a8f] 11 /path/to/lib64/openmpi/mca_pml_ucx.so(+0x9c59) [0x7f545cb49c59] 12 /path/to/lib64/libmpi.so.40(mca_pml_base_select+0x272) [0x7f546910b0d4] 13 /path/to/lib64/libmpi.so.40(ompi_mpi_init+0x889) [0x7f546911f6e9] 14 /path/to/lib64/libmpi.so.40(MPI_Init+0x7f) [0x7f546909d2e9] 15 ./hello_world() [0x400709] 16 /lib64/libc.so.6(__libc_start_main+0xed) [0x7f5468a3634d] 17 ./hello_world() [0x40063a] ================================= [HOSTNAME:76168] *** Process received signal *** [HOSTNAME:76168] Signal: Segmentation fault (11) [HOSTNAME:76168] Signal code: (-6) [HOSTNAME:76168] Failing at address: 0x3e800012988 [HOSTNAME:76168] [ 0] /lib64/libpthread.so.0(+0x13f80)[0x7f5468df9f80] [HOSTNAME:76168] [ 1] /path/to/lib64/libuct.so.0(uct_base_iface_t_init+0xf4)[0x7f545c33dd5f] [HOSTNAME:76168] [ 2] /path/to/lib64/ucx/libuct_rocm.so.0(+0x7d0d)[0x7f54579cad0d] [HOSTNAME:76168] [ 3] /path/to/lib64/ucx/libuct_rocm.so.0(+0x7ec2)[0x7f54579caec2] [HOSTNAME:76168] [ 4] /path/to/lib64/libuct.so.0(uct_iface_open+0x18f)[0x7f545c339df8] [HOSTNAME:76168] [ 5] /path/to/lib64/libucp.so.0(ucp_worker_iface_open+0x49f)[0x7f545c5d6456] [HOSTNAME:76168] [ 6] /path/to/lib64/libucp.so.0(+0x5a853)[0x7f545c5d4853] [HOSTNAME:76168] [ 7] /path/to/lib64/libucp.so.0(ucp_worker_create+0x6b2)[0x7f545c5d9d6c] [HOSTNAME:76168] [ 8] /path/to/lib64/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0xcb)[0x7f545cb44a8f] [HOSTNAME:76168] [ 9] /path/to/lib64/openmpi/mca_pml_ucx.so(+0x9c59)[0x7f545cb49c59] [HOSTNAME:76168] [10] /path/to/lib64/libmpi.so.40(mca_pml_base_select+0x272)[0x7f546910b0d4] [HOSTNAME:76168] [11] /path/to/lib64/libmpi.so.40(ompi_mpi_init+0x889)[0x7f546911f6e9] [HOSTNAME:76168] [12] /path/to/lib64/libmpi.so.40(MPI_Init+0x7f)[0x7f546909d2e9] [HOSTNAME:76168] [13] ./hello_world[0x400709] [HOSTNAME:76168] [14] /lib64/libc.so.6(__libc_start_main+0xed)[0x7f5468a3634d] [HOSTNAME:76168] [15] ./hello_world[0x40063a] [HOSTNAME:76168] *** End of error message *** -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 0 on node HOSTNAME exited on signal 11 (Segmentation fault). --------------------------------------------------------------------------
ucx_info -v
1.12.0 tarball built like so:
$ ../configure CFLAGS="-O0 -ggdb" CXXFLAGS="-O0 -ggdb" --prefix=/path/to --with-rocm --without-knem --without-cuda --without-java $ make -j `nproc` $ make install
export OMPI_MCA_pml=ucx export OMPI_MCA_btl=^openib,tcp
cat /etc/issue
cat /etc/redhat-release
uname -a
cat /etc/mlnx-release
$ cat /etc/issue Welcome to SUSE Linux Enterprise Server 15 SP3 (x86_64) - Kernel \r (\l). eth0: \4{eth0} \6{eth0} $ uname -a Linux HOSTNAME 5.3.18-57_11.0.18-cray_shasta_c #1 SMP Sun Jul 18 18:14:52 UTC 2021 (15c194a) x86_64 x86_64 x86_64 GNU/Linux
GPUs are AMD Instinct MI250X.
$ /usr/sbin/dkms status amdgpu, 5.13.11.21.50-1384496, 5.3.18-57_11.0.18-cray_shasta_c, x86_64: installed
OpenMPI v4.1.1 tarball built like so:
$ ../configure CFLAGS="-O0 -ggdb" CXXFLAGS="-O0 -ggdb" --prefix=/path/to --with-ucx=/path/to --without-verbs $ make `nproc` $ make install
Interestingly enough, I couldn't run ucx_info -d because that also segfaults. Here is a backtrace from ucx_info -d:
ucx_info -d
# Memory domain: rocm_ipc # Component: rocm_ipc # register: unlimited, cost: 9 nsec # remote key: 56 bytes # # Transport: rocm_ipc # Device: rocm_ipc # Type: accelerator # System device: <unknown> [Thread 0x7fffeffff700 (LWP 76976) exited] Thread 1 "ucx_info" received signal SIGSEGV, Segmentation fault. 0x00007ffff77d7d5f in uct_base_iface_t_init (self=0x6615c0, _myclass=0x7ffff7a10d20 <uct_base_iface_t_class>, _init_count=0x7fffffffd838, ops=0x7ffff63dfbe0 <uct_rocm_ipc_iface_ops>, internal_ops=0x0, md=0x7ffff63dfad0 <md>, worker=0x6609d0, params=0x7fffffffdb30, config=0x631060) at ../../../src/uct/base/uct_iface.c:511 511 ucs_assert(internal_ops->iface_estimate_perf != NULL); #0 0x00007ffff77d7d5f in uct_base_iface_t_init (self=0x6615c0, _myclass=0x7ffff7a10d20 <uct_base_iface_t_class>, _init_count=0x7fffffffd838, ops=0x7ffff63dfbe0 <uct_rocm_ipc_iface_ops>, internal_ops=0x0, md=0x7ffff63dfad0 <md>, worker=0x6609d0, params=0x7fffffffdb30, config=0x631060) at ../../../src/uct/base/uct_iface.c:511 #1 0x00007ffff61d9d0d in uct_rocm_ipc_iface_t_init (self=0x6615c0, _myclass=0x7ffff63dfdc0 <uct_rocm_ipc_iface_t_class>, _init_count=0x7fffffffd838, md=0x7ffff63dfad0 <md>, worker=0x6609d0, params=0x7fffffffdb30, tl_config=0x631060) at ../../../../src/uct/rocm/ipc/rocm_ipc_iface.c:228 #2 0x00007ffff61d9ec2 in uct_rocm_ipc_iface_t_new (arg0=0x7ffff63dfad0 <md>, arg1=0x6609d0, arg2=0x7fffffffdb30, arg3=0x631060, obj_p=0x7fffffffd8e8) at ../../../../src/uct/rocm/ipc/rocm_ipc_iface.c:262 #3 0x00007ffff77d3df8 in uct_iface_open (md=0x7ffff63dfad0 <md>, worker=0x6609d0, params=0x7fffffffdb30, config=0x631060, iface_p=0x7fffffffd8e8) at ../../../src/uct/base/uct_md.c:267 #4 0x000000000040443b in print_iface_info (worker=0x6609d0, md=0x7ffff63dfad0 <md>, resource=0x65dc00) at ../../../../src/tools/info/tl_info.c:156 #5 0x00000000004054d9 in print_tl_info (md=0x7ffff63dfad0 <md>, tl_name=0x65dc00 "rocm_ipc", resources=0x65dc00, num_resources=1, print_opts=16, print_flags=0) at ../../../../src/tools/info/tl_info.c:375 #6 0x0000000000405a4c in print_md_info (component=0x7ffff63dfa00 <uct_rocm_ipc_component>, component_attr=0x7fffffffe0a0, md_name=0x7fffffffe060 "rocm_ipc", print_opts=16, print_flags=0, req_tl_name=0x0) at ../../../../src/tools/info/tl_info.c:482 #7 0x0000000000405d20 in print_uct_component_info (component=0x7ffff63dfa00 <uct_rocm_ipc_component>, print_opts=16, print_flags=0, req_tl_name=0x0) at ../../../../src/tools/info/tl_info.c:588 #8 0x0000000000405dbd in print_uct_info (print_opts=16, print_flags=0, req_tl_name=0x0) at ../../../../src/tools/info/tl_info.c:614 #9 0x00000000004068ed in main (argc=2, argv=0x7fffffffe2f8) at ../../../../src/tools/info/ucx_info.c:257
With UCX 1.11.2, ucx_info -d runs to completion, and here is the output from the same place it failed in v1.12.0:
# Memory domain: rocm_ipc # Component: rocm_ipc # register: unlimited, cost: 9 nsec # remote key: 56 bytes # # Transport: rocm_ipc # Device: rocm_ipc # System device: <unknown> # # capabilities: # bandwidth: 409600.00/ppn + 0.00 MB/sec # latency: 1 nsec # overhead: 0 nsec # put_zcopy: unlimited, up to 1 iov # put_opt_zcopy_align: <= 4 # put_align_mtu: <= 4 # get_zcopy: unlimited, up to 1 iov # get_opt_zcopy_align: <= 4 # get_align_mtu: <= 4 # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 8 bytes # iface address: 4 bytes # error handling: none # # # Memory domain: cma # Component: cma # register: unlimited, cost: 9 nsec # # Transport: cma # Device: memory # System device: <unknown> # # capabilities: # bandwidth: 0.00/ppn + 11145.00 MB/sec # latency: 80 nsec # overhead: 400 nsec # put_zcopy: unlimited, up to 16 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 1 # get_zcopy: unlimited, up to 16 iov # get_opt_zcopy_align: <= 1 # get_align_mtu: <= 1 # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 8 bytes # iface address: 4 bytes # error handling: peer failure, ep_check #
The text was updated successfully, but these errors were encountered:
I think we fixed it with 1.12.1. Can you pls try with https://github.com/openucx/ucx/releases/tag/v1.12.1-rc4? cc @edgargabriel
Sorry, something went wrong.
Confirmed. UCX v1.11.2 works, UCX v1.12.0 segfaults, and UCX v1.12.1 works.
Thanks!
No branches or pull requests
Describe the bug
A call to
MPI_Init
gives this error message:This worked in UCX v1.11.2 and fails in UCX v1.12.0. Here is a backtrace:
Steps to Reproduce
ucx_info -v
)1.12.0 tarball built like so:
Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
cat /etc/mlnx-release
(the string identifies software and firmware setup)GPUs are AMD Instinct MI250X.
Additional information (depending on the issue)
OpenMPI v4.1.1 tarball built like so:
Interestingly enough, I couldn't run
ucx_info -d
because that also segfaults. Here is a backtrace fromucx_info -d
:With UCX 1.11.2,
ucx_info -d
runs to completion, and here is the output from the same place it failed in v1.12.0:The text was updated successfully, but these errors were encountered: