Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The image kubectl-delivery has an arm/aarch version ? #1857

Closed
aavbsouza opened this issue Jul 10, 2023 · 9 comments
Closed

The image kubectl-delivery has an arm/aarch version ? #1857

aavbsouza opened this issue Jul 10, 2023 · 9 comments

Comments

@aavbsouza
Copy link

Hello everyone. I am trying to run the training-operator with a small test-cluster of rpi4. The training operator have been installed and appears to be working. However I had tried to run a small test and I got an error with the launcher container.

The image kubectl-delivery on the github appears to be last updated two years ago and only shows amd64 archs
https://hub.docker.com/r/mpioperator/kubectl-delivery/tags

The log of the launcher container is:

Defaulted container "mpi" out of: mpi, kubectl-delivery (init)
Error from server (BadRequest): container "mpi" in pod "simple-hello-world-launcher" is waiting to start: PodInitializing

Is this expected ?

thanks

@tenzen-y
Copy link
Member

Hello everyone. I am trying to run the training-operator with a small test-cluster of rpi4. The training operator have been installed and appears to be working. However I had tried to run a small test and I got an error with the launcher container.

The image kubectl-delivery on the github appears to be last updated two years ago and only shows amd64 archs https://hub.docker.com/r/mpioperator/kubectl-delivery/tags

The log of the launcher container is:

Defaulted container "mpi" out of: mpi, kubectl-delivery (init)
Error from server (BadRequest): container "mpi" in pod "simple-hello-world-launcher" is waiting to start: PodInitializing

Is this expected ?

thanks

Yes, that image isn't built automatically. But building the image might be good, feel free to open PR:

https://github.com/kubeflow/training-operator/blob/9e084ff0b0904b82312225c4baca295baf482b1e/.github/workflows/publish-core-images.yaml

But I would suggest using the MPIJob v2 (https://github.com/kubeflow/mpi-operator) instead of MPIJob v1.

@tenzen-y
Copy link
Member

/kind question

@aavbsouza
Copy link
Author

Hello @tenzen-y it appears that the dockerfile for this image does not exist on this repository and it was removed from the mpi-operator with this commit (https://github.com/kubeflow/mpi-operator/pull/494/files). What is the replacement for this image when using mpijob v2 ? Would be to pass as argument on the CRD definition (#1525) ?

Another question is mandatory to use a scheduling plugin like the one provided by the volcano project?

thanks

@tenzen-y
Copy link
Member

it appears that the dockerfile for this image does not exist on this repository and it was removed from the mpi-operator with this commit (https://github.com/kubeflow/mpi-operator/pull/494/files).

Oh, yes. It seems that we need to copy the Dockerfile to this repository (kubeflow/training-operator).

What is the replacement for this image when using mpijob v2 ? Would be to pass as argument on the CRD definition (#1525) ?

We have 2 MPIJob,s and those MPIJobs are hosted in separate operator (repository):

  • MPIJob v1 is deployed as part of training-operator.
  • MPIJob v2 is deployed by mpi-operator.

Then, MPIJob v1 uses kubectl exec to initialize MPI env via kubectl-delivery, and MPIJob v2 uses ssh to initialize MPI env. So MPIJob v2 doesn't need to kubectl-delivery and is scalable rather than MPIJob v1.

@tenzen-y
Copy link
Member

tenzen-y commented Jul 11, 2023

Another question is mandatory to use a scheduling plugin like the one provided by the volcano project?

The training-operator supports the volcano gang-scheduling, and you can refer to the following docs how to use volcano scheduler:

https://www.kubeflow.org/docs/components/training/job-scheduling

However, we currently confirm only volcano gang scheduling. So I'm not sure if the training operator can work well with the other volcano scheduler plugins.

@aavbsouza
Copy link
Author

Hello @tenzen-y . I am looking into the mpi-operator repository. Is there any guidelines on how to support ssh on the images to be used by this operator (thinking about custom images). I am seeing only one set of images that add the ssh (https://github.com/kubeflow/mpi-operator/tree/master/build/base), is there any documentation about the contract expected by the mpi operator ? thanks again =)

@tenzen-y
Copy link
Member

tenzen-y commented Jul 11, 2023

I am looking into the mpi-operator repository. Is there any guidelines on how to support ssh on the images to be used by this operator (thinking about custom images). I am seeing only one set of images that add the ssh (https://github.com/kubeflow/mpi-operator/tree/master/build/base), is there any documentation about the contract expected by the mpi operator ? thanks again =)

Can you create a separate issue on the mpi-operator repository?
I think that isn't related to training-operator. Thanks for your understanding.

@tenzen-y
Copy link
Member

/close

If you have any other questions about the training-operator, feel free to open new issues.

@google-oss-prow
Copy link

@tenzen-y: Closing this issue.

In response to this:

/close

If you have any other questions about the training-operator, feel free to open new issues.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants