-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpirun failure on 4 nodes #6107
Comments
Your problem is right here:
You need to set the ssh key on one of your nodes |
Hi Ralph, I verified that all hosts can SSH to each other:
With one node, or a permutation of three nodes, mpirun is ok:
The issue is only reproduced when I'm using four nodes in the mpirun command. |
The problem is that you cannot ssh from one of those nodes to another node. mpirun uses a tree-like launch pattern. You need to be able to ssh from (for example) horovod-3 to horovod-4 (and the other combinations) as well. |
@regel Haven't heard back from you in a few days, so I'm going to assume Ralph's answer was the correct one. Feel free to ping back here if you need more help. |
@regel Do you solve the problem? |
Yes, I solved the issue. You need to cross check each and every node with each other. For instance, you have rpi01, rpi02, rpi03, and rpi04 nodes. When you run; pi@rpi01 $ ssh rpi01 (should login) also, pi@rpi01 $ ssh rpi02 (should login) If any of these give any issue or ask again with ...(yes/no)? prompt which means the connections are not properly set them up. Good luck! |
Background information
Running horovod/open-mpi in a cluster with multiple nodes. All nodes are declared in /etc/hosts, and can properly SSH to each other.
mpirun is OK using 3 nodes, but KO using 4 nodes regardless of which nodes are chosen. All 4 nodes have identical hardware, OS version, installed packages, and are running in the same data-center.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
yum install
Please describe the system on which you are running
Details of the problem
mpirun is OK using 3 nodes, but KO using 4 nodes regardless of which nodes are chosen.
The text was updated successfully, but these errors were encountered: