Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Operator WG and Kubeflow 1.5 release #2105

Closed
DomFleischmann opened this issue Jan 19, 2022 · 14 comments
Closed

Training Operator WG and Kubeflow 1.5 release #2105

DomFleischmann opened this issue Jan 19, 2022 · 14 comments

Comments

@DomFleischmann
Copy link
Contributor

@kubeflow/wg-training-leads let's use this tracking issue to coordinate the integration of Training Operator with the Kubeflow 1.5 release.

First off a heads up that the feature freeze phase will start Wednesday (26th January). Before then I'd like to have updated this repo with the manifests of the kubeflow/training-operator repo, in order to be able to cut the first RC tag in this repo.

So what I'd like to ask as a first step before the feature freeze is:

What version of Training Operator would you like to include for the 1.5 release?
Could you provide me with a branch/tag for this version? It doesn't have to be final. The branch/tag provided can keep on getting fixes through out the release process, but not new features
Are there any open issues/work in progress that you will be working on for your version as the KF release process will be progressing?
What will the K8s supported versions be for kubeflow/training-operator?

This was referenced Jan 24, 2022
@johnugeorge
Copy link
Member

johnugeorge commented Jan 26, 2022

The RC release tag v1.4.0-rc.0 is created. https://github.com/kubeflow/training-operator/tree/v1.4.0-rc.0

@kimwnasptd In the last release, I see that manifests structure in this repo is different from training operator repo. Just wondering how did this happen?

Can you sync RC tagged manifests from training operator repo with this repo? https://github.com/kubeflow/training-operator/tree/v1.4.0-rc.0/manifests

MPI operator is bundled in training operator and this folder(https://github.com/kubeflow/manifests/tree/master/apps/mpi-job/upstream) can be deleted.

/cc @terrytangyuan

@kimwnasptd
Copy link
Member

Thanks for the update @johnugeorge @terrytangyuan!

@kimwnasptd In the last release, I see that manifests structure in this repo is different from training operator repo. Just wondering how did this happen?

Hmmm, not sure. It seems that the last PR that updated the Operator's manifests was for copying over the RC2 manifests #2032. Looking at the stable 1.3.0 version of the Operator I see the crds folder https://github.com/kubeflow/training-operator/tree/v1.3.0/manifests/base. So the issue was that we never updated the manifests from RC2 to the final release. Will keep in mind for this release.

I'll create a PR to update the manifests from Training Operator now, I have an automated script for this.

@terrytangyuan
Copy link
Member

Great. Thanks!

@jbottum
Copy link

jbottum commented Jan 26, 2022

@terrytangyuan @johnugeorge does the training operator v1.4.0-rc.0 include elastic training ? kubeflow/community#522

@terrytangyuan
Copy link
Member

@terrytangyuan @johnugeorge does the training operator v1.4.0-rc.0 include elastic training ? kubeflow/community#522

Yes

@johnugeorge
Copy link
Member

@terrytangyuan @johnugeorge does the training operator v1.4.0-rc.0 include elastic training ? kubeflow/community#522

Elastic Pytorch training is supported through PyTorchJob in the new release and Elastic horovod training is supported through MPIJob

https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/echo/echo.yaml

https://github.com/kubeflow/mpi-operator/blob/master/examples/horovod/tensorflow-mnist-elastic.yaml

@DomFleischmann
Copy link
Contributor Author

Hi @kubeflow/wg-training-leads , Before the manifest testing on Wednesday, Feb 9th, the release team is planning on cutting another RC to use for the testing.

Based on a previous communication, the release team will be using Training Operator version v1.4.0rc0. If the Training WG have identified any issues since the feature freeze and would like to update the AutoML version before the manifest testing, let us know before Feb. 9th. Thank you!

@johnugeorge

@kimwnasptd
Copy link
Member

After syncing in today's AutoML/Training meeting we will keep on using the v1.4-rc0 tag for the RC1 of the manifests. A newer RC might be cut for the kubeflow/training-operator repo later on, in case more issues arise.

Also another note, the @kubeflow/wg-automl-leads will update the kubeflow/katib e2e tests to be using the v1.5-branch branch of the manifests. This means that the e2e tests will be using the latest training operators, so we'll be keeping an eye on issues that might arise.

@kimwnasptd
Copy link
Member

@kubeflow/wg-training-leads I'm working on finalizing the manifests for the release, as we are getting closer to the release date of March 9th.

Regarding the kubeflow/training-operator repo, when are you planning to cut the final v1.4.0 tag? Could you do it within this week so that we can get the manifests closer to their final state?

@johnugeorge
Copy link
Member

@kimwnasptd Yes. we will cut it this week

@kimwnasptd
Copy link
Member

Just saw it's ready. Congrats on the release 🎉

@shannonbradshaw
Copy link

Hey, folks. Are there docs changes required as a result of this work? If so, please create an issue and mention in on this docs tracking issue: kubeflow/website#3130

@juliusvonkohout
Copy link
Member

/close

There has been no activity for a long time. Please reopen if necessary.

@google-oss-prow
Copy link

@juliusvonkohout: Closing this issue.

In response to this:

/close

There has been no activity for a long time. Please reopen if necessary.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants