segfault when trying to open significantly too many contexts #10370

joshfisher-cornelisnetworks · 2022-05-11T20:33:37Z

Thank you for taking the time to submit an issue!

Background information

Found issue when using a 2 HFI system but 1 was disabled causing command to open way too many contexts for 1 HFI

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

No output

Please describe the system on which you are running

Operating system/version: RHEL 7.9
Computer hardware: x86_64
Network type: Back-to-back pair

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world

In a 2 HFI system there was 1 HFI disabled and ran a test that would work on a 2 HFI system. Expected a failure due to too many contexts, but got a segfault instead of a more graceful abort. Found that when running with np and ppr closer to the limit, but still over, there is a more graceful abort.

command ran:
openmpi-v4.1.2/bin/mpirun -np 192 --map-by ppr:96:node -host hostA:96,hostB:96 --bind-to core --display-map --tag-output --allow-run-as-root --mca mtl ofi --mca btl ofi -x LD_LIBRARY_PATH=path/to/opx/build -x FI_PROVIDER=opx FI_LOG_LEVEL=warn -x IMB-MPI1 -include Uniband,Biband -npmin 192 -iter 10000 -msglog 0:15

Backtrace found:

#0 0x00007fd9e35eb6e8 in mca_btl_ofi_context_finalize () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/openmpi/mca_btl_ofi.so
#1 0x00007fd9e35ebab9 in mca_btl_ofi_context_alloc_scalable () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/openmpi/mca_btl_ofi.so
#2 0x00007fd9e35e7f9f in mca_btl_ofi_component_init () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/openmpi/mca_btl_ofi.so
#3 0x00007fd9f3a74d16 in mca_btl_base_select () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/libopen-pal.so.40
#4 0x00007fd9e37f2441 in mca_bml_r2_component_init () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/openmpi/mca_bml_r2.so
#5 0x00007fd9f4e5f3ce in mca_bml_base_init () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/libmpi.so.40
#6 0x00007fd9f4e9d4fd in ompi_mpi_init () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/libmpi.so.40
#7 0x00007fd9f4e46875 in PMPI_Init_thread () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/libmpi.so.40
#8 0x0000000000405265 in main (argc=9, argv=0x7ffdfc1a2938) at imb.cpp:295

The text was updated successfully, but these errors were encountered:

jsquyres · 2022-06-06T20:33:46Z

@joshfisher-cornelisnetworks Did you intend for this to be a self-assigned issue? I.e., if it's an HFI issue, that's a Cornelis issue, which is you, right?

mwheinz · 2022-06-06T20:36:46Z

Josh, did you mean to open a Jira?

joshfisher-cornelisnetworks · 2022-06-06T20:42:58Z

No the fault is consistently happening in the OMPI part of the code and everything we have been able to find seems to point to this being an OMPI issue. We expect a fault, but it looks like at some point, OMPI creates a segfault instead of a more graceful fault. Let me double check with who I have been working with on this issue, but last we talked, we decided it was worthy of an OMPI bug.

mwheinz · 2022-06-06T20:49:32Z

So, as point of history, the OFI BTL was originally written by Intel as part of the OmniPath project. If the failure is in the OFI BTL it might be Cornelis’ responsibility now. It’s not clear. You might want to reach out to Sean Hefty - check OFIWG/libfabric to reach out to him.

jsquyres · 2022-06-06T21:08:47Z

Sean Hefty won't have much of a clue on Open MPI code -- he's more the libfabric guy than the Open MPI guy. Regardless, unless there's a non-HFI reproducer, I don't know if anyone else in the Open MPI community can work on this, because no one else will have HFI hardware.

joshfisher-cornelisnetworks · 2022-06-09T21:42:01Z

Have a pull request for the issue that caused this segfault targeted for v4.1.x.

jsquyres added the Target: v4.1.x label Jun 6, 2022

jsquyres added this to the v4.1.5 milestone Jun 6, 2022

jsquyres assigned joshfisher-cornelisnetworks Jun 6, 2022

jsquyres mentioned this issue Jun 9, 2022

v4.1.x: opal Segfault avoidence in class and mca/btl/ofi #10461

Closed

This was referenced Jun 10, 2022

opal: Segfault avoidence in class and mca/btl/ofi #10466

Closed

opal: Segfault avoidance in mca/btl/ofi #10467

Merged

bwbarrett modified the milestones: v4.1.5, v4.1.6 Feb 23, 2023

bwbarrett modified the milestones: v4.1.6, v4.1.7 Sep 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segfault when trying to open significantly too many contexts #10370

segfault when trying to open significantly too many contexts #10370

joshfisher-cornelisnetworks commented May 11, 2022 •

edited by jsquyres

Loading

jsquyres commented Jun 6, 2022

mwheinz commented Jun 6, 2022

joshfisher-cornelisnetworks commented Jun 6, 2022

mwheinz commented Jun 6, 2022

jsquyres commented Jun 6, 2022

joshfisher-cornelisnetworks commented Jun 9, 2022

segfault when trying to open significantly too many contexts #10370

segfault when trying to open significantly too many contexts #10370

Comments

joshfisher-cornelisnetworks commented May 11, 2022 • edited by jsquyres Loading

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

jsquyres commented Jun 6, 2022

mwheinz commented Jun 6, 2022

joshfisher-cornelisnetworks commented Jun 6, 2022

mwheinz commented Jun 6, 2022

jsquyres commented Jun 6, 2022

joshfisher-cornelisnetworks commented Jun 9, 2022

joshfisher-cornelisnetworks commented May 11, 2022 •

edited by jsquyres

Loading

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.