Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI-Operator run example failed #464

Open
andyzheung opened this issue Apr 5, 2022 · 2 comments
Open

MPI-Operator run example failed #464

andyzheung opened this issue Apr 5, 2022 · 2 comments

Comments

@andyzheung
Copy link

andyzheung commented Apr 5, 2022

I setup the mpi-operator v0.3.0

and try to deploy the example:
mpi-operator-0.3.0/examples/horovod/tensorflow-mnist.yaml

but it seem can not run it correctly:

NAME READY STATUS RESTARTS AGE
tensorflow-mnist-launcher-ffkrh 0/1 Error 4 2m39s
tensorflow-mnist-worker-0 1/1 Running 0 2m39s
tensorflow-mnist-worker-1 1/1 Running 0 2m39s

#kubectl logs tensorflow-mnist-launcher-ffkrh
Failed to add the host to the list of known hosts (/root/.ssh/known_hosts).
Failed to add the host to the list of known hosts (/root/.ssh/known_hosts).
Permission denied, please try again.
Permission denied, please try again.
[email protected]: Permission denied (publickey,password).

@andyzheung
Copy link
Author

andyzheung commented Apr 5, 2022

I try to run the mpi-operator-0.3.0/examples/horovod/tensorflow-mnist-elastic.yaml
still can't not work, did this repo have any maintain?

kubectl logs tensorflow-mnist-elastic-launcher-rjlbk
Traceback (most recent call last):
File "/usr/local/bin/horovodrun", line 33, in
sys.exit(load_entry_point('horovod==0.20.0', 'console_scripts', 'horovodrun')())
File "/usr/local/lib/python3.7/dist-packages/horovod/runner/launch.py", line 722, in run_commandline
_run(args)
File "/usr/local/lib/python3.7/dist-packages/horovod/runner/launch.py", line 710, in _run
return _run_elastic(args)
File "/usr/local/lib/python3.7/dist-packages/horovod/runner/launch.py", line 623, in _run_elastic
gloo_run_elastic(settings, env, args.command)
File "/usr/local/lib/python3.7/dist-packages/horovod/runner/gloo_run.py", line 322, in gloo_run_elastic
launch_gloo_elastic(command, exec_command, settings, env, get_common_interfaces, rendezvous)
File "/usr/local/lib/python3.7/dist-packages/horovod/runner/gloo_run.py", line 287, in launch_gloo_elastic
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/horovod/runner/elastic/driver.py", line 182, in _discover_hosts
self._notify_workers_host_changes(self._host_manager.current_hosts)
File "/usr/local/lib/python3.7/dist-packages/horovod/runner/elastic/driver.py", line 199, in _notify_workers_host_changes
if current_hosts.count_available_slots() >= self._min_np:
File "/usr/local/lib/python3.7/dist-packages/horovod/runner/elastic/discovery.py", line 71, in count_available_slots
return sum([self.get_slots(host) for host in self._host_assignment_order])
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

@alculquicondor
Copy link
Collaborator

This is a duplicate of #445

If you can send a fix, I'm happy to review. Otherwise, you will have to resort to the v1 controller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants