Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training WG and Kubeflow 1.4 release #1962

Closed
kimwnasptd opened this issue Aug 11, 2021 · 10 comments
Closed

Training WG and Kubeflow 1.4 release #1962

kimwnasptd opened this issue Aug 11, 2021 · 10 comments

Comments

@kimwnasptd
Copy link
Member

kimwnasptd commented Aug 11, 2021

@kubeflow/wg-training-leads @johnugeorge @Jeffwan @gaocegege let use this tracking issue to coordinate the integration of Training operators with the Kubeflow 1.4 release.

My current understanding is that there is some work in progress for using the updated codebase for the operators, in which all of them will be sharing common code. This work is currently tracked in https://github.com/kubeflow/tf-operator/tree/all-in-one-operator.

Could you provide an update here as well on what's the ETA for this work and next steps to include it in the manifests? I want to cut the first RC by the end of this week, so I'll need to know which manifests to copy from.

@johnugeorge
Copy link
Member

Can you cut RC release after 15th(Early next week) ? The only pending issue is adding new Manifests which will be completed very soon and testing around it.

@Jeffwan
Copy link
Member

Jeffwan commented Aug 13, 2021

This work is currently tracked in https://github.com/kubeflow/tf-operator/tree/all-in-one-operator.

@kimwnasptd
An update on this. We've merge dev branch codes to master in kubeflow/training-operator#1320 and prepare manifest and images now as Johnu said. Please still copy manifest from tf-operator and it will includes everything

Tracking project: https://github.com/kubeflow/tf-operator/projects/2. (Only P0 issues are blocking issues for 1.4 release)

@kimwnasptd
Copy link
Member Author

kimwnasptd commented Aug 15, 2021

Can you cut RC release after 15th(Early next week) ?

@johnugeorge @Jeffwan yes, we will most probably postpone the the start of the feature freeze phase for a couple days.

Please still copy manifest from tf-operator and it will includes everything

Since all the manifests for all operators will live under kubeflow/tf-operator/manifests/base/ how can someone know the version of a specific operator? For the previous release each operator had its own version https://github.com/kubeflow/manifests#kubeflow-components-versions. Is this still the case?

Also, since all the manifests for all the operators are in one central place then this would mean we should change the folder structure in this repo to not have distinct folders for each operator. I.e.:

apps/
  admission-webhook/
  centraldashboard/
  ...
  training-operators/
    <contents from: https://github.com/kubeflow/tf-operator/blob/master/manifests/>
    
common/
docs/
...

@Jeffwan
Copy link
Member

Jeffwan commented Aug 16, 2021

@kimwnasptd

how can someone know the version of a specific operator? For the previous release each operator had its own version

The plan is to give user a universal operator which supports all frameworks. That means we won't have pytorch mxnet and xgboost operator in 1.4 release.

Is this still the case?

No, we just need one now.

training-operators/
    <contents from: https://github.com/kubeflow/tf-operator/blob/master/manifests/>

That's what we planned.

@johnugeorge
Copy link
Member

johnugeorge commented Aug 18, 2021

related #1976

@kimwnasptd
Copy link
Member Author

Small heads, we are now in the Feature Freeze phase and updating the docs in kubeflow.org
https://github.com/kubeflow/manifests/tree/master/docs/releases/release-1.4#timeline

If you have any issues you'd like to work on for updating the docs, for the KF 1.4 release, please add a comment in kubeflow/website#2879 so we can track them.

@zijianjoy
Copy link
Contributor

Created issue for training-operator in #2018

@johnugeorge
Copy link
Member

New RC is cut which also includes fix for #2018

https://github.com/kubeflow/tf-operator/releases/tag/v1.3.0-rc.1

@kimwnasptd
Copy link
Member Author

closing this, since KF 1.4 has been released

/close

@google-oss-prow
Copy link

@kimwnasptd: Closing this issue.

In response to this:

closing this, since KF 1.4 has been released

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants