Training Operator WG and Kubeflow 1.5 release #2105

DomFleischmann · 2022-01-19T10:36:19Z

@kubeflow/wg-training-leads let's use this tracking issue to coordinate the integration of Training Operator with the Kubeflow 1.5 release.

First off a heads up that the feature freeze phase will start Wednesday (26th January). Before then I'd like to have updated this repo with the manifests of the kubeflow/training-operator repo, in order to be able to cut the first RC tag in this repo.

So what I'd like to ask as a first step before the feature freeze is:

What version of Training Operator would you like to include for the 1.5 release?
Could you provide me with a branch/tag for this version? It doesn't have to be final. The branch/tag provided can keep on getting fixes through out the release process, but not new features
Are there any open issues/work in progress that you will be working on for your version as the KF release process will be progressing?
What will the K8s supported versions be for kubeflow/training-operator?

johnugeorge · 2022-01-26T20:05:28Z

The RC release tag v1.4.0-rc.0 is created. https://github.com/kubeflow/training-operator/tree/v1.4.0-rc.0

@kimwnasptd In the last release, I see that manifests structure in this repo is different from training operator repo. Just wondering how did this happen?

Can you sync RC tagged manifests from training operator repo with this repo? https://github.com/kubeflow/training-operator/tree/v1.4.0-rc.0/manifests

MPI operator is bundled in training operator and this folder(https://github.com/kubeflow/manifests/tree/master/apps/mpi-job/upstream) can be deleted.

/cc @terrytangyuan

kimwnasptd · 2022-01-26T21:00:51Z

Thanks for the update @johnugeorge @terrytangyuan!

@kimwnasptd In the last release, I see that manifests structure in this repo is different from training operator repo. Just wondering how did this happen?

Hmmm, not sure. It seems that the last PR that updated the Operator's manifests was for copying over the RC2 manifests #2032. Looking at the stable 1.3.0 version of the Operator I see the crds folder https://github.com/kubeflow/training-operator/tree/v1.3.0/manifests/base. So the issue was that we never updated the manifests from RC2 to the final release. Will keep in mind for this release.

I'll create a PR to update the manifests from Training Operator now, I have an automated script for this.

terrytangyuan · 2022-01-26T21:12:33Z

Great. Thanks!

jbottum · 2022-01-26T21:36:04Z

@terrytangyuan @johnugeorge does the training operator v1.4.0-rc.0 include elastic training ? kubeflow/community#522

terrytangyuan · 2022-01-26T21:55:38Z

@terrytangyuan @johnugeorge does the training operator v1.4.0-rc.0 include elastic training ? kubeflow/community#522

Yes

johnugeorge · 2022-01-27T07:29:32Z

@terrytangyuan @johnugeorge does the training operator v1.4.0-rc.0 include elastic training ? kubeflow/community#522

Elastic Pytorch training is supported through PyTorchJob in the new release and Elastic horovod training is supported through MPIJob

https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/echo/echo.yaml

https://github.com/kubeflow/mpi-operator/blob/master/examples/horovod/tensorflow-mnist-elastic.yaml

DomFleischmann · 2022-02-08T09:12:15Z

Hi @kubeflow/wg-training-leads , Before the manifest testing on Wednesday, Feb 9th, the release team is planning on cutting another RC to use for the testing.

Based on a previous communication, the release team will be using Training Operator version v1.4.0rc0. If the Training WG have identified any issues since the feature freeze and would like to update the AutoML version before the manifest testing, let us know before Feb. 9th. Thank you!

@johnugeorge

kimwnasptd · 2022-02-09T15:46:32Z

After syncing in today's AutoML/Training meeting we will keep on using the v1.4-rc0 tag for the RC1 of the manifests. A newer RC might be cut for the kubeflow/training-operator repo later on, in case more issues arise.

Also another note, the @kubeflow/wg-automl-leads will update the kubeflow/katib e2e tests to be using the v1.5-branch branch of the manifests. This means that the e2e tests will be using the latest training operators, so we'll be keeping an eye on issues that might arise.

kimwnasptd · 2022-03-01T07:17:56Z

@kubeflow/wg-training-leads I'm working on finalizing the manifests for the release, as we are getting closer to the release date of March 9th.

Regarding the kubeflow/training-operator repo, when are you planning to cut the final v1.4.0 tag? Could you do it within this week so that we can get the manifests closer to their final state?

johnugeorge · 2022-03-01T11:42:08Z

@kimwnasptd Yes. we will cut it this week

kimwnasptd · 2022-03-04T17:21:23Z

Just saw it's ready. Congrats on the release 🎉

shannonbradshaw · 2022-03-07T23:03:15Z

Hey, folks. Are there docs changes required as a result of this work? If so, please create an issue and mention in on this docs tracking issue: kubeflow/website#3130

juliusvonkohout · 2023-08-24T16:23:21Z

/close

There has been no activity for a long time. Please reopen if necessary.

google-oss-prow · 2023-08-24T16:23:25Z

@juliusvonkohout: Closing this issue.

In response to this:

/close

There has been no activity for a long time. Please reopen if necessary.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

This was referenced Jan 24, 2022

Notebooks WG and Kubeflow 1.5 release #2109

Closed

KF 1.5 tracking #2112

Closed

kimwnasptd mentioned this issue Jan 26, 2022

Sync kubeflow training operator manifests v1.4.0 rc.0 #2119

Merged

kimwnasptd mentioned this issue Feb 11, 2022

Update kubeflow/kubeflow manifests from v1.5.0-rc.0 #2123

Merged

kimwnasptd mentioned this issue Mar 4, 2022

Update the README #2157

Merged

google-oss-prow bot closed this as completed Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Operator WG and Kubeflow 1.5 release #2105

Training Operator WG and Kubeflow 1.5 release #2105

DomFleischmann commented Jan 19, 2022

johnugeorge commented Jan 26, 2022 •

edited

Loading

kimwnasptd commented Jan 26, 2022

terrytangyuan commented Jan 26, 2022

jbottum commented Jan 26, 2022

terrytangyuan commented Jan 26, 2022

johnugeorge commented Jan 27, 2022

DomFleischmann commented Feb 8, 2022

kimwnasptd commented Feb 9, 2022

kimwnasptd commented Mar 1, 2022

johnugeorge commented Mar 1, 2022

kimwnasptd commented Mar 4, 2022

shannonbradshaw commented Mar 7, 2022

juliusvonkohout commented Aug 24, 2023

google-oss-prow bot commented Aug 24, 2023

Training Operator WG and Kubeflow 1.5 release #2105

Training Operator WG and Kubeflow 1.5 release #2105

Comments

DomFleischmann commented Jan 19, 2022

johnugeorge commented Jan 26, 2022 • edited Loading

kimwnasptd commented Jan 26, 2022

terrytangyuan commented Jan 26, 2022

jbottum commented Jan 26, 2022

terrytangyuan commented Jan 26, 2022

johnugeorge commented Jan 27, 2022

DomFleischmann commented Feb 8, 2022

kimwnasptd commented Feb 9, 2022

kimwnasptd commented Mar 1, 2022

johnugeorge commented Mar 1, 2022

kimwnasptd commented Mar 4, 2022

shannonbradshaw commented Mar 7, 2022

juliusvonkohout commented Aug 24, 2023

google-oss-prow bot commented Aug 24, 2023

johnugeorge commented Jan 26, 2022 •

edited

Loading