Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault when trying to open significantly too many contexts #10370

Open
joshfisher-cornelisnetworks opened this issue May 11, 2022 · 6 comments
Assignees
Milestone

Comments

@joshfisher-cornelisnetworks
Copy link
Contributor

joshfisher-cornelisnetworks commented May 11, 2022

Thank you for taking the time to submit an issue!

Background information

Found issue when using a 2 HFI system but 1 was disabled causing command to open way too many contexts for 1 HFI

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

No output

Please describe the system on which you are running

  • Operating system/version: RHEL 7.9
  • Computer hardware: x86_64
  • Network type: Back-to-back pair

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world

In a 2 HFI system there was 1 HFI disabled and ran a test that would work on a 2 HFI system. Expected a failure due to too many contexts, but got a segfault instead of a more graceful abort. Found that when running with np and ppr closer to the limit, but still over, there is a more graceful abort.

command ran:
openmpi-v4.1.2/bin/mpirun -np 192 --map-by ppr:96:node -host hostA:96,hostB:96 --bind-to core --display-map --tag-output --allow-run-as-root --mca mtl ofi --mca btl ofi -x LD_LIBRARY_PATH=path/to/opx/build -x FI_PROVIDER=opx FI_LOG_LEVEL=warn -x IMB-MPI1 -include Uniband,Biband -npmin 192 -iter 10000 -msglog 0:15

Backtrace found:

#0 0x00007fd9e35eb6e8 in mca_btl_ofi_context_finalize () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/openmpi/mca_btl_ofi.so
#1 0x00007fd9e35ebab9 in mca_btl_ofi_context_alloc_scalable () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/openmpi/mca_btl_ofi.so
#2 0x00007fd9e35e7f9f in mca_btl_ofi_component_init () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/openmpi/mca_btl_ofi.so
#3 0x00007fd9f3a74d16 in mca_btl_base_select () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/libopen-pal.so.40
#4 0x00007fd9e37f2441 in mca_bml_r2_component_init () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/openmpi/mca_bml_r2.so
#5 0x00007fd9f4e5f3ce in mca_bml_base_init () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/libmpi.so.40
#6 0x00007fd9f4e9d4fd in ompi_mpi_init () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/libmpi.so.40
#7 0x00007fd9f4e46875 in PMPI_Init_thread () from /mnt/ci/mpi/gcc/openmpi-v4.1.2/lib/libmpi.so.40
#8 0x0000000000405265 in main (argc=9, argv=0x7ffdfc1a2938) at imb.cpp:295
@jsquyres jsquyres added this to the v4.1.5 milestone Jun 6, 2022
@jsquyres
Copy link
Member

jsquyres commented Jun 6, 2022

@joshfisher-cornelisnetworks Did you intend for this to be a self-assigned issue? I.e., if it's an HFI issue, that's a Cornelis issue, which is you, right?

@mwheinz
Copy link

mwheinz commented Jun 6, 2022

Josh, did you mean to open a Jira?

@joshfisher-cornelisnetworks
Copy link
Contributor Author

No the fault is consistently happening in the OMPI part of the code and everything we have been able to find seems to point to this being an OMPI issue. We expect a fault, but it looks like at some point, OMPI creates a segfault instead of a more graceful fault. Let me double check with who I have been working with on this issue, but last we talked, we decided it was worthy of an OMPI bug.

@mwheinz
Copy link

mwheinz commented Jun 6, 2022

So, as point of history, the OFI BTL was originally written by Intel as part of the OmniPath project. If the failure is in the OFI BTL it might be Cornelis’ responsibility now. It’s not clear. You might want to reach out to Sean Hefty - check OFIWG/libfabric to reach out to him.

@jsquyres
Copy link
Member

jsquyres commented Jun 6, 2022

Sean Hefty won't have much of a clue on Open MPI code -- he's more the libfabric guy than the Open MPI guy. Regardless, unless there's a non-HFI reproducer, I don't know if anyone else in the Open MPI community can work on this, because no one else will have HFI hardware.

@joshfisher-cornelisnetworks
Copy link
Contributor Author

Have a pull request for the issue that caused this segfault targeted for v4.1.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants