-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpirun launch failure on heterogeneous system #6762
Comments
I was able to fix the issue. The x86 installation had found zlib and the arm cores had not, so the x86 cores were compressing the data and the arm cores were not decompressing. I did not specify to use libz or to not use libz, but one of them used it and the other did not. They both have libz installed in the same place. Is this a bug in the configure file, or is just something that people need to be on the lookout for? |
Mmm... yes, I can see how Open MPI would assume that the configuration is the same everywhere, and would therefore not detect that situation (and therefore end up in a correct-but-confusing error message). @rhc54 Is there an easy way around this? |
Only solution I can think of would be to add a flag to the "phone home" msg from each daemon indicating if compression is supported or not, and then turn it off if any daemon doesn't support it. Would also have to include a flag in the launch msg (which is the only one that gets compressed) indicating if compression was or wasn't used. Perhaps easier would be to just automatically turn off compression if someone indicates a hetero system - but that would be a significant launch performance "hit" for any large hetero system (if there are any). |
I just thought it was odd that the ARM cores did not automatically find the libz while the intel cores did. Libz was installed on all nodes in the same place, although they did have slightly different names. This just made is very difficult to debug. |
FWIW, it's likely not the presence of |
Okay, that makes sense. Thank you for the explanation. |
I am having the exact same problem. I also have an ARM and an x86 processor. I have checked, both have the |
You only checked |
Thank you very much for your hint. Indeed, the zlib.h was missing in the ARM device. Nevertheless, I still get the same issue after installing it. I guess the paths are still different and this is still causing an error. |
After installing this package, you need to re-run |
Thank you very much for your help. Really. I got it working in the end. I reinstalled the zlib library and then compiled mpi again and its working now. The error definitely comes from this zlib.h which is for some reason, in Ubuntu server not in |
I am glad you figured it out. I realize now I never posted what I did to fix the issue. I just configured without zlib. |
This really helped! Thanks a lot! |
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
v4.0.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
tarball
On x86:
./configure --prefix=/global/home/users/johns/opt/4.0.1 --with-slurm=no --with-ucx=/global/home/users/johns/opt/ucx --with-verbs=no --enable-heterogeneous --enable-debug --with-hwloc=internal
and on ARM:
./configure --prefix=/home/johns/opt/4.0.1 --with-slurm=no --with-ucx=/home/johns/opt/ucx --with-verbs=no --enable-heterogeneous --enable-debug --with-hwloc=internal
Please describe the system on which you are running
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
I would like to launch a job with a single mpirun across a heterogenous system that has both arm and x86 cores and ompi is installed in different places.
I do not know if this is relevant, but I can launch jobs from an Intel node on the Arm nodes.
Issue #4437 was similar and was fixed by using a homogeneous system. I have to use the hybrid system.
The text was updated successfully, but these errors were encountered: