-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elastic resize broken in v2 operator: Horovod requires the slots parameter #445
Comments
/help |
@alculquicondor: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Also, we need to adapt the intel entry point to the change of the hostfile format: https://github.com/kubeflow/mpi-operator/blob/master/examples/base/intel-entrypoint.sh |
@alculquicondor Thanks for mentioning this. Sure. IIUC, we need to modify the hostfile in mpi-operator/pkg/controller/mpi_job_controller.go Line 1186 in 31d4575
to buffer.WriteString(fmt.Sprintf("%s%s-%d.%s slots=%d\n", mpiJob.Name, workerSuffix, i, workersService, slots)) , right?
Also, we need to adapt the new hostfile format in
|
/assign |
In the v2 controller, the slots parameter in the hostfile was replaced by environment variables:
mpi-operator/v2/pkg/controller/mpi_job_controller.go
Lines 116 to 117 in b88edad
However, horovod still requires the number of slots. So we need to put the parameter back. The duplication of information should be fine.
It's unclear if horovod supports intel.
We need an E2E test to ensure that the fix works long term.
The text was updated successfully, but these errors were encountered: