Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document openmpi symbol loading code #3812

Closed
fweik opened this issue Jul 24, 2020 · 3 comments · Fixed by #3823
Closed

Document openmpi symbol loading code #3812

fweik opened this issue Jul 24, 2020 · 3 comments · Fixed by #3823

Comments

@fweik
Copy link
Contributor

fweik commented Jul 24, 2020

The function mpi_init in communication.cpp contains the following code:

#ifdef OPEN_MPI
  openmpi_fix_vader();

  void *handle = nullptr;
  int mode = RTLD_NOW | RTLD_GLOBAL;
#ifdef RTLD_NOLOAD
  mode |= RTLD_NOLOAD;
#endif
  void *_openmpi_symbol = dlsym(RTLD_DEFAULT, "MPI_Init");
  if (!_openmpi_symbol) {
    fprintf(stderr, "%d: Aborting because unable to find OpenMPI symbol.\n",
            this_node);
    errexit();
  }
  Dl_info _openmpi_info;
  dladdr(_openmpi_symbol, &_openmpi_info);

  if (!handle)
    handle = dlopen(_openmpi_info.dli_fname, mode);

  if (!handle) {
    fprintf(stderr,
            "%d: Aborting because unable to load libmpi into the "
            "global symbol space.\n",
            this_node);
    errexit();
  }
#endif

It needs to be documented what this is doing and why it is needed. Also it is not clear to me that
this is safe in all circumstances, e.g. what happens if the shared has already be loaded before.

@mkuron
Copy link
Member

mkuron commented Jul 24, 2020

This was originally inspired by mpi4py (https://github.com/mpi4py/mpi4py/blob/4e3f47b6691c8f5a038e73f84b8d43b03f16627f/src/lib-mpi/compat/openmpi.h). It's needed because OpenMPI dlopens its submodules. These are unable to find the top-level OpenMPI library if that was dlopened itself, i.e. when the Python interpreter dlopens a module that is linked against OpenMPI. It's about some weird two-level symbol namespace thing. None of this is documented by OpenMPI and it certainly is a terrible hack. However, the only alternative would be to drop OpenMPI support and require people to use MPICH.

@fweik
Copy link
Contributor Author

fweik commented Jul 24, 2020

Thank you, I'll add some comments...

@kodiakhq kodiakhq bot closed this as completed in #3823 Jul 27, 2020
kodiakhq bot added a commit that referenced this issue Jul 27, 2020
Fixes #3812.

Description of changes:
- Cleanup the mpi init code
- GIve the python interface shared ownership of the mpi env
@mkuron
Copy link
Member

mkuron commented Aug 4, 2020

According to open-mpi/ompi#3705, this issue was fixed in OpenMPI 3.0.0 and higher. We should thus be able to limit our workaround to older versions.

kodiakhq bot added a commit that referenced this issue Aug 4, 2020
It is not needed anymore since open-mpi/ompi#3705. Related: #3812, where we document this trick.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants