Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpirun launch failure on heterogeneous system #6762

Closed
snyjm-18 opened this issue Jun 18, 2019 · 13 comments
Closed

mpirun launch failure on heterogeneous system #6762

snyjm-18 opened this issue Jun 18, 2019 · 13 comments
Labels

Comments

@snyjm-18
Copy link

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v4.0.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

tarball
On x86:
./configure --prefix=/global/home/users/johns/opt/4.0.1 --with-slurm=no --with-ucx=/global/home/users/johns/opt/ucx --with-verbs=no --enable-heterogeneous --enable-debug --with-hwloc=internal
and on ARM:
./configure --prefix=/home/johns/opt/4.0.1 --with-slurm=no --with-ucx=/home/johns/opt/ucx --with-verbs=no --enable-heterogeneous --enable-debug --with-hwloc=internal

Please describe the system on which you are running

  • Operating system/version: CentOS Linux 7 (Core)/CentOS Linux 7 (AltArch)
  • Computer hardware: x86_64-unknown-linux-gnu/aarch64-unknown-linux-gnu
  • Network type: infiniband (mlx5)

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

I would like to launch a job with a single mpirun across a heterogenous system that has both arm and x86 cores and ompi is installed in different places.

[johns@jupiter008 ~]$ mpirun -H jupiter008 hostname : --prefix /home/johns/opt/4.0.1 -H jupiter-bf09 /usr/bin/hostname
jupiter008.hpcadvisorycouncil.com
[jupiter-bf09:02375] [[9309,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[9309,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(359)

This is something that should be reported to the developers.
--------------------------------------------------------------------------

I do not know if this is relevant, but I can launch jobs from an Intel node on the Arm nodes.

[johns@jupiter008 ~]$ mpirun --prefix /home/johns/opt/4.0.1 -H jupiter-bf08,jupiter-bf09 /usr/bin/hostname
jupiter-bf08
jupiter-bf09

Issue #4437 was similar and was fixed by using a homogeneous system. I have to use the hybrid system.

@snyjm-18
Copy link
Author

snyjm-18 commented Jun 19, 2019

I was able to fix the issue. The x86 installation had found zlib and the arm cores had not, so the x86 cores were compressing the data and the arm cores were not decompressing. I did not specify to use libz or to not use libz, but one of them used it and the other did not. They both have libz installed in the same place. Is this a bug in the configure file, or is just something that people need to be on the lookout for?

@jsquyres
Copy link
Member

Mmm... yes, I can see how Open MPI would assume that the configuration is the same everywhere, and would therefore not detect that situation (and therefore end up in a correct-but-confusing error message).

@rhc54 Is there an easy way around this?

@rhc54
Copy link
Contributor

rhc54 commented Jun 20, 2019

Only solution I can think of would be to add a flag to the "phone home" msg from each daemon indicating if compression is supported or not, and then turn it off if any daemon doesn't support it. Would also have to include a flag in the launch msg (which is the only one that gets compressed) indicating if compression was or wasn't used.

Perhaps easier would be to just automatically turn off compression if someone indicates a hetero system - but that would be a significant launch performance "hit" for any large hetero system (if there are any).

@snyjm-18
Copy link
Author

I just thought it was odd that the ARM cores did not automatically find the libz while the intel cores did. Libz was installed on all nodes in the same place, although they did have slightly different names. This just made is very difficult to debug.

@jsquyres
Copy link
Member

FWIW, it's likely not the presence of libz that is the issue -- it's the presence of the "devel" package for libz. I.e., when compiling Open MPI from source, you need the libz header files to be able to compile for libz. It's possible that the libz "devel" package was installed on your x86 boxen, but not on the ARM boxen (the exact name of the package varies from distro to distro).

@snyjm-18
Copy link
Author

Okay, that makes sense. Thank you for the explanation.

@Lip651
Copy link

Lip651 commented Apr 28, 2021

I was able to fix the issue. The x86 installation had found zlib and the arm cores had not, so the x86 cores were compressing the data and the arm cores were not decompressing. I did not specify to use libz or to not use libz, but one of them used it and the other did not. They both have libz installed in the same place. Is this a bug in the configure file, or is just something that people need to be on the lookout for?

I am having the exact same problem. I also have an ARM and an x86 processor. I have checked, both have the zlib library installed, however not in the same place. One is under /lib/x86_64-linux-gnu/libz.so.1.2.11 and on the ARM lib/aarch64-linux-gnu/libz.so.1.2.1.1 . Could you please detail how did you solve this issue ? I mean, how did you do that the zlib would be compiled from both devices ? Did you change something in the ./configure installation ?

@ggouaillardet
Copy link
Contributor

You only checked libz runtime is present for runtime. In order to build zlib support, Open MPI needs the zlib.h include file (typically in /usr/include/zlib.h). This is likely provided by the zlib-dev package, and I suspect it is installed on the x86 nodes, but not on the ARM ones.

@Lip651
Copy link

Lip651 commented Apr 28, 2021

You only checked libz runtime is present for runtime. In order to build zlib support, Open MPI needs the zlib.h include file (typically in /usr/include/zlib.h). This is likely provided by the zlib-dev package, and I suspect it is installed on the x86 nodes, but not on the ARM ones.

Thank you very much for your hint. Indeed, the zlib.h was missing in the ARM device. Nevertheless, I still get the same issue after installing it. I guess the paths are still different and this is still causing an error.

@ggouaillardet
Copy link
Contributor

After installing this package, you need to re-run configure && make && make install

@Lip651
Copy link

Lip651 commented Apr 28, 2021

After installing this package, you need to re-run configure && make && make install

Thank you very much for your help. Really. I got it working in the end. I reinstalled the zlib library and then compiled mpi again and its working now. The error definitely comes from this zlib.h which is for some reason, in Ubuntu server not in /usr/include. I had this problem personnally on a raspberry pi 3 with Ubuntu Server 18.04 installed.

@snyjm-18
Copy link
Author

I am glad you figured it out. I realize now I never posted what I did to fix the issue. I just configured without zlib.

@jsquyres jsquyres closed this as completed Jan 2, 2022
@PinkGranite
Copy link

This really helped! Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants