-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1322: Modified manifests to use all-in-one training-operator #1346
1322: Modified manifests to use all-in-one training-operator #1346
Conversation
Actions taken: - replaced tf-job-operator => training-operator - replaced kubeflow-tfjobs- => kubeflow-training- - moved crds for mxjobs, tgjobs, pytorchjobs and xgboostjobs from config/crd/bases to manifests/base/ and prefixed them with crd_ Ref: kubeflow#1322 Testing steps: To be added Work in Progress
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Hi @deepak-muley. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@googlebot I signed it! |
/ok-to-test |
/cc @Jeffwan /cc @gaocegege |
@johnugeorge: GitHub didn't allow me to request PR reviews from the following users: /cc. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold |
Please help compare https://github.com/kubeflow/tf-operator/blob/c60dabb7daa3ff9ffb4eecfc24ee4a39bce4b7ca/config/manager/manager.yaml#L7-L9 and your deployment.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this @deepak-muley!
@kubeflow/wg-training-leads Since we are using /manifests
folder for our deployment, do we need /config
folder ?
Training operator was found to be working <pre> k -n kubeflow logs -f training-operator-694766989-pp2j4 I0812 21:43:24.739862 1 request.go:645] Throttling request took 1.048945631s, request: GET:https://172.19.0.1:443/apis/networking.k8s.io/v1?timeout=32s 2021-08-12T21:43:25.694Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": ":8080"} 2021-08-12T21:43:25.790Z INFO setup starting manager 2021-08-12T21:43:25.790Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"} 2021-08-12T21:43:25.790Z INFO controller-runtime.manager.controller.tf-operator Starting EventSource {"source": "kind source: /, Kind="} 2021-08-12T21:43:25.790Z INFO controller-runtime.manager.controller.mxnet-operator Starting EventSource {"source": "kind source: /, Kind="} 2021-08-12T21:43:25.791Z INFO controller-runtime.manager.controller.pytorchjob-operator Starting EventSource {"source": "kind source: /, Kind="} 2021-08-12T21:43:25.791Z INFO controller-runtime.manager.controller.xgboostjob-operator Starting EventSource {"source": "kind source: /, Kind="} 2021-08-12T21:43:26.289Z INFO controller-runtime.manager.controller.xgboostjob-operator Starting EventSource {"source": "kind source: /, Kind="} 2021-08-12T21:43:26.294Z INFO controller-runtime.manager.controller.pytorchjob-operator Starting EventSource {"source": "kind source: /, Kind="} 2021-08-12T21:43:26.589Z INFO controller-runtime.manager.controller.mxnet-operator Starting EventSource {"source": "kind source: /, Kind="} 2021-08-12T21:43:26.688Z INFO controller-runtime.manager.controller.tf-operator Starting EventSource {"source": "kind source: /, Kind="} 2021-08-12T21:43:26.889Z INFO controller-runtime.manager.controller.tf-operator Starting EventSource {"source": "kind source: /, Kind="} 2021-08-12T21:43:26.889Z INFO controller-runtime.manager.controller.pytorchjob-operator Starting EventSource {"source": "kind source: /, Kind="} 2021-08-12T21:43:26.890Z INFO controller-runtime.manager.controller.xgboostjob-operator Starting EventSource {"source": "kind source: /, Kind="} 2021-08-12T21:43:26.890Z INFO controller-runtime.manager.controller.mxnet-operator Starting EventSource {"source": "kind source: /, Kind="} 2021-08-12T21:43:26.990Z INFO controller-runtime.manager.controller.xgboostjob-operator Starting Controller 2021-08-12T21:43:26.990Z INFO controller-runtime.manager.controller.tf-operator Starting Controller 2021-08-12T21:43:26.990Z INFO controller-runtime.manager.controller.tf-operator Starting workers {"worker count": 1} 2021-08-12T21:43:26.990Z INFO controller-runtime.manager.controller.pytorchjob-operator Starting Controller 2021-08-12T21:43:26.991Z INFO controller-runtime.manager.controller.xgboostjob-operator Starting workers {"worker count": 1} 2021-08-12T21:43:26.991Z INFO controller-runtime.manager.controller.pytorchjob-operator Starting workers {"worker count": 1} 2021-08-12T21:43:26.991Z INFO controller-runtime.manager.controller.mxnet-operator Starting Controller 2021-08-12T21:43:26.991Z INFO controller-runtime.manager.controller.mxnet-operator Starting workers {"worker count": 1} </pre>
/hold cancel |
manifests/base/deployment.yaml
Outdated
serviceAccountName: tf-job-operator | ||
- command: | ||
- /manager | ||
# disable leader-elect now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the lines that are no longer needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jeffwan should we be removing it or can be handled through another tracker which enables leader election. where do i find details of why it was disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be needed for beta release. @deepak-muley please create an issue to track it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
- now controller-gen generates the crds directly in manifests/base instead of config/crd/bases - updated setup-training-operator.sh to use manifests/overlays/standalone
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Jeffwan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Actions taken:
- replaced tf-job-operator => training-operator
- replaced kubeflow-tfjobs- => kubeflow-training-
- moved crds for mxjobs, tgjobs, pytorchjobs and xgboostjobs from
config/crd/bases to manifests/base/ and prefixed them with crd_
- merged config/manager deployment with manifests/base/deployment
Ref: #1322
Testing steps:
docker build -f build/images/training-operator/Dockerfile -t deepakmuley/kubeflow-training-operator:1 .
k get pods -n test-tf-operator
NAME READY STATUS RESTARTS AGE
dist-mnist-for-e2e-test-ps-0 1/1 Running 0 77s
dist-mnist-for-e2e-test-ps-1 1/1 Running 0 77s
dist-mnist-for-e2e-test-worker-0 1/1 Running 0 77s
dist-mnist-for-e2e-test-worker-1 1/1 Running 0 77s
dist-mnist-for-e2e-test-worker-2 1/1 Running 0 77s
dist-mnist-for-e2e-test-worker-3 1/1 Running 0 77s