-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Operator WG and Kubeflow 1.5 release #2105
Comments
The RC release tag v1.4.0-rc.0 is created. https://github.com/kubeflow/training-operator/tree/v1.4.0-rc.0 @kimwnasptd In the last release, I see that manifests structure in this repo is different from training operator repo. Just wondering how did this happen? Can you sync RC tagged manifests from training operator repo with this repo? https://github.com/kubeflow/training-operator/tree/v1.4.0-rc.0/manifests MPI operator is bundled in training operator and this folder(https://github.com/kubeflow/manifests/tree/master/apps/mpi-job/upstream) can be deleted. /cc @terrytangyuan |
Thanks for the update @johnugeorge @terrytangyuan!
Hmmm, not sure. It seems that the last PR that updated the Operator's manifests was for copying over the RC2 manifests #2032. Looking at the stable I'll create a PR to update the manifests from Training Operator now, I have an automated script for this. |
Great. Thanks! |
@terrytangyuan @johnugeorge does the training operator v1.4.0-rc.0 include elastic training ? kubeflow/community#522 |
Yes |
Elastic Pytorch training is supported through PyTorchJob in the new release and Elastic horovod training is supported through MPIJob |
Hi @kubeflow/wg-training-leads , Before the manifest testing on Wednesday, Feb 9th, the release team is planning on cutting another RC to use for the testing. Based on a previous communication, the release team will be using Training Operator version v1.4.0rc0. If the Training WG have identified any issues since the feature freeze and would like to update the AutoML version before the manifest testing, let us know before Feb. 9th. Thank you! |
After syncing in today's AutoML/Training meeting we will keep on using the Also another note, the @kubeflow/wg-automl-leads will update the kubeflow/katib e2e tests to be using the v1.5-branch branch of the manifests. This means that the e2e tests will be using the latest training operators, so we'll be keeping an eye on issues that might arise. |
@kubeflow/wg-training-leads I'm working on finalizing the manifests for the release, as we are getting closer to the release date of March 9th. Regarding the |
@kimwnasptd Yes. we will cut it this week |
Just saw it's ready. Congrats on the release 🎉 |
Hey, folks. Are there docs changes required as a result of this work? If so, please create an issue and mention in on this docs tracking issue: kubeflow/website#3130 |
/close There has been no activity for a long time. Please reopen if necessary. |
@juliusvonkohout: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@kubeflow/wg-training-leads let's use this tracking issue to coordinate the integration of Training Operator with the Kubeflow 1.5 release.
First off a heads up that the feature freeze phase will start Wednesday (26th January). Before then I'd like to have updated this repo with the manifests of the kubeflow/training-operator repo, in order to be able to cut the first RC tag in this repo.
So what I'd like to ask as a first step before the feature freeze is:
What version of Training Operator would you like to include for the 1.5 release?
Could you provide me with a branch/tag for this version? It doesn't have to be final. The branch/tag provided can keep on getting fixes through out the release process, but not new features
Are there any open issues/work in progress that you will be working on for your version as the KF release process will be progressing?
What will the K8s supported versions be for kubeflow/training-operator?
The text was updated successfully, but these errors were encountered: