Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1322: Modified manifests to use all-in-one training-operator #1346

Merged

Conversation

deepak-muley
Copy link
Contributor

@deepak-muley deepak-muley commented Aug 12, 2021

Actions taken:
- replaced tf-job-operator => training-operator
- replaced kubeflow-tfjobs- => kubeflow-training-
- moved crds for mxjobs, tgjobs, pytorchjobs and xgboostjobs from
config/crd/bases to manifests/base/ and prefixed them with crd_
- merged config/manager deployment with manifests/base/deployment
Ref: #1322
Testing steps:

  • Deleted exsting operator deployments for tfjob, pytorch etc
  • Built the training-operator as follows
    docker build -f build/images/training-operator/Dockerfile -t deepakmuley/kubeflow-training-operator:1 .
  • updated the manifests to use this new image
  • deployed the training-operator
  • Build and ran tfjob as per developer guide
    k get pods -n test-tf-operator
    NAME READY STATUS RESTARTS AGE
    dist-mnist-for-e2e-test-ps-0 1/1 Running 0 77s
    dist-mnist-for-e2e-test-ps-1 1/1 Running 0 77s
    dist-mnist-for-e2e-test-worker-0 1/1 Running 0 77s
    dist-mnist-for-e2e-test-worker-1 1/1 Running 0 77s
    dist-mnist-for-e2e-test-worker-2 1/1 Running 0 77s
    dist-mnist-for-e2e-test-worker-3 1/1 Running 0 77s

Actions taken:
    - replaced tf-job-operator => training-operator
    - replaced kubeflow-tfjobs- => kubeflow-training-
    - moved crds for mxjobs, tgjobs, pytorchjobs and xgboostjobs from
      config/crd/bases to manifests/base/ and prefixed them with crd_
Ref: kubeflow#1322
Testing steps: To be added
Work in Progress
@google-cla
Copy link

google-cla bot commented Aug 12, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@aws-kf-ci-bot
Copy link
Contributor

Hi @deepak-muley. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@deepak-muley
Copy link
Contributor Author

@googlebot I signed it!

@johnugeorge
Copy link
Member

/ok-to-test

@johnugeorge
Copy link
Member

/cc @Jeffwan /cc @gaocegege

@google-oss-robot
Copy link

@johnugeorge: GitHub didn't allow me to request PR reviews from the following users: /cc.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @Jeffwan /cc @gaocegege

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@deepak-muley
Copy link
Contributor Author

/hold

@Jeffwan
Copy link
Member

Jeffwan commented Aug 12, 2021

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this @deepak-muley!
@kubeflow/wg-training-leads Since we are using /manifests folder for our deployment, do we need /config folder ?

manifests/base/cluster-role.yaml Outdated Show resolved Hide resolved
manifests/base/kustomization.yaml Outdated Show resolved Hide resolved
manifests/overlays/kubeflow/kustomization.yaml Outdated Show resolved Hide resolved
manifests/overlays/standalone/kustomization.yaml Outdated Show resolved Hide resolved
Training operator was found to be working
<pre>
k -n kubeflow logs -f training-operator-694766989-pp2j4
I0812 21:43:24.739862       1 request.go:645] Throttling request took 1.048945631s, request: GET:https://172.19.0.1:443/apis/networking.k8s.io/v1?timeout=32s
2021-08-12T21:43:25.694Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": ":8080"}
2021-08-12T21:43:25.790Z	INFO	setup	starting manager
2021-08-12T21:43:25.790Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
2021-08-12T21:43:25.790Z	INFO	controller-runtime.manager.controller.tf-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:25.790Z	INFO	controller-runtime.manager.controller.mxnet-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:25.791Z	INFO	controller-runtime.manager.controller.pytorchjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:25.791Z	INFO	controller-runtime.manager.controller.xgboostjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.289Z	INFO	controller-runtime.manager.controller.xgboostjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.294Z	INFO	controller-runtime.manager.controller.pytorchjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.589Z	INFO	controller-runtime.manager.controller.mxnet-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.688Z	INFO	controller-runtime.manager.controller.tf-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.889Z	INFO	controller-runtime.manager.controller.tf-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.889Z	INFO	controller-runtime.manager.controller.pytorchjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.890Z	INFO	controller-runtime.manager.controller.xgboostjob-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.890Z	INFO	controller-runtime.manager.controller.mxnet-operator	Starting EventSource	{"source": "kind source: /, Kind="}
2021-08-12T21:43:26.990Z	INFO	controller-runtime.manager.controller.xgboostjob-operator	Starting Controller
2021-08-12T21:43:26.990Z	INFO	controller-runtime.manager.controller.tf-operator	Starting Controller
2021-08-12T21:43:26.990Z	INFO	controller-runtime.manager.controller.tf-operator	Starting workers	{"worker count": 1}
2021-08-12T21:43:26.990Z	INFO	controller-runtime.manager.controller.pytorchjob-operator	Starting Controller
2021-08-12T21:43:26.991Z	INFO	controller-runtime.manager.controller.xgboostjob-operator	Starting workers	{"worker count": 1}
2021-08-12T21:43:26.991Z	INFO	controller-runtime.manager.controller.pytorchjob-operator	Starting workers	{"worker count": 1}
2021-08-12T21:43:26.991Z	INFO	controller-runtime.manager.controller.mxnet-operator	Starting Controller
2021-08-12T21:43:26.991Z	INFO	controller-runtime.manager.controller.mxnet-operator	Starting workers	{"worker count": 1}
</pre>
@deepak-muley
Copy link
Contributor Author

/hold cancel

@deepak-muley deepak-muley changed the title 1322: Modified manifests to use all-in-one training-operator WIP 1322: Modified manifests to use all-in-one training-operator Aug 12, 2021
serviceAccountName: tf-job-operator
- command:
- /manager
# disable leader-elect now
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the lines that are no longer needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jeffwan should we be removing it or can be handled through another tracker which enables leader election. where do i find details of why it was disabled?

Copy link
Member

@Jeffwan Jeffwan Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be needed for beta release. @deepak-muley please create an issue to track it.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

- now controller-gen generates the crds directly in manifests/base
  instead of config/crd/bases
- updated setup-training-operator.sh to use manifests/overlays/standalone
@Jeffwan
Copy link
Member

Jeffwan commented Aug 13, 2021

/lgtm
/approve

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants