Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic resize broken in v2 operator: Horovod requires the slots parameter #445

Closed
Tracked by #507
alculquicondor opened this issue Dec 7, 2021 · 6 comments · Fixed by #523
Closed
Tracked by #507

Elastic resize broken in v2 operator: Horovod requires the slots parameter #445

alculquicondor opened this issue Dec 7, 2021 · 6 comments · Fixed by #523
Assignees
Labels
help wanted Extra attention is needed

Comments

@alculquicondor
Copy link
Collaborator

In the v2 controller, the slots parameter in the hostfile was replaced by environment variables:

openMPISlotsEnv = "OMPI_MCA_orte_set_default_slots"
intelMPISlotsEnv = "I_MPI_PERHOST"

However, horovod still requires the number of slots. So we need to put the parameter back. The duplication of information should be fine.
It's unclear if horovod supports intel.

We need an E2E test to ensure that the fix works long term.

@alculquicondor
Copy link
Collaborator Author

/help

@google-oss-prow
Copy link

@alculquicondor:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow google-oss-prow bot added the help wanted Extra attention is needed label Dec 7, 2021
@alculquicondor
Copy link
Collaborator Author

Also, we need to adapt the intel entry point to the change of the hostfile format: https://github.com/kubeflow/mpi-operator/blob/master/examples/base/intel-entrypoint.sh

@alculquicondor
Copy link
Collaborator Author

@tenzen-y we should fix this as part of the next release #507, but I don't have bandwidth. Do you?

@tenzen-y
Copy link
Member

tenzen-y commented Feb 6, 2023

@tenzen-y we should fix this as part of the next release #507, but I don't have bandwidth. Do you?

@alculquicondor Thanks for mentioning this. Sure.

IIUC, we need to modify the hostfile in

buffer.WriteString(fmt.Sprintf("%s%s-%d.%s.%s.svc\n", mpiJob.Name, workerSuffix, i, workersService, mpiJob.Namespace))

to buffer.WriteString(fmt.Sprintf("%s%s-%d.%s slots=%d\n", mpiJob.Name, workerSuffix, i, workersService, slots)), right?

Also, we need to adapt the new hostfile format in

cat /etc/mpi/hostfile | while read host
.

@tenzen-y tenzen-y mentioned this issue Feb 6, 2023
10 tasks
@tenzen-y
Copy link
Member

tenzen-y commented Feb 9, 2023

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants