Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate TFJob Python SDK #1103

Merged
merged 1 commit into from
Dec 9, 2019
Merged

Generate TFJob Python SDK #1103

merged 1 commit into from
Dec 9, 2019

Conversation

jinchihe
Copy link
Member

@jinchihe jinchihe commented Dec 4, 2019

Generate TFJob Python SDK

fixes: #167

The PR is includes:

Next:

  • Need to consider how to merge some function in current py/kubeflow, especially py/kubeflow/tf_operator/tf_job_client.py .

This change is Reviewable

@coveralls
Copy link

coveralls commented Dec 4, 2019

Coverage Status

Coverage remained the same at 96.512% when pulling 363a0d4 on jinchihe:generate_tfjob_sdk into 495b328 on kubeflow:master.

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: d3419b40-170c-11ea-93d3-257117e2c129

@jinchihe
Copy link
Member Author

jinchihe commented Dec 5, 2019

Why failed to set up cluster, seems nothing with the code change.

INFO:root:Subprocess output:
INFO:root:Error from server (AlreadyExists): clusterrolebindings.rbac.authorization.k8s.io "default-admin" already exists
INFO:root:Creating /mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-c43c06e-3472-d38f/output/artifacts/junit_setupcluster.xml
Traceback (most recent call last):
 File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
   "__main__", fname, loader, pkg_name)
 File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
   exec code in run_globals
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-c43c06e-3472-d38f/src/kubeflow/tf-operator/py/kubeflow/tf_operator/de
ploy.py", line 358, in <module>
   main()
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-c43c06e-3472-d38f/src/kubeflow/tf-operator/py/kubeflow/tf_operator/de
ploy.py", line 354, in main
   args.func(args)
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 49, in wrapped_f
   return Retrying(*dargs, **dkw).call(f, *args, **kw)
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 212, in call
   raise attempt.get()
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 247, in get
   six.reraise(self.value[0], self.value[1], self.value[2])
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 200, in call
   attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-c43c06e-3472-d38f/src/kubeflow/tf-operator/py/kubeflow/tf_operator/de
ploy.py", line 161, in setup_cluster
   "--clusterrole=cluster-admin", "--user=" + account
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-c43c06e-3472-d38f/src/kubeflow/testing/py/kubeflow/testing/util.py",
line 88, in run
   " ".join(command), process.returncode), "\n".join(output))
subprocess.CalledProcessError: Command 'cmd: kubectl create clusterrolebinding default-admin --clusterrole=cluster-admin --user=kubeflow-
[email protected] exited with code 1' returned non-zero exit status 1

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: c9a2c9f0-1721-11ea-b5b2-a33d2660b61c

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: c2deba20-173a-11ea-b5b2-a33d2660b61c

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 80dbc120-1741-11ea-b5b2-a33d2660b61c

@gaocegege
Copy link
Member

Why failed to set up cluster, seems nothing with the code change.

INFO:root:Subprocess output:
INFO:root:Error from server (AlreadyExists): clusterrolebindings.rbac.authorization.k8s.io "default-admin" already exists
INFO:root:Creating /mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-c43c06e-3472-d38f/output/artifacts/junit_setupcluster.xml
Traceback (most recent call last):
 File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
   "__main__", fname, loader, pkg_name)
 File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
   exec code in run_globals
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-c43c06e-3472-d38f/src/kubeflow/tf-operator/py/kubeflow/tf_operator/de
ploy.py", line 358, in <module>
   main()
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-c43c06e-3472-d38f/src/kubeflow/tf-operator/py/kubeflow/tf_operator/de
ploy.py", line 354, in main
   args.func(args)
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 49, in wrapped_f
   return Retrying(*dargs, **dkw).call(f, *args, **kw)
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 212, in call
   raise attempt.get()
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 247, in get
   six.reraise(self.value[0], self.value[1], self.value[2])
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 200, in call
   attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-c43c06e-3472-d38f/src/kubeflow/tf-operator/py/kubeflow/tf_operator/de
ploy.py", line 161, in setup_cluster
   "--clusterrole=cluster-admin", "--user=" + account
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-c43c06e-3472-d38f/src/kubeflow/testing/py/kubeflow/testing/util.py",
line 88, in run
   " ".join(command), process.returncode), "\n".join(output))
subprocess.CalledProcessError: Command 'cmd: kubectl create clusterrolebinding default-admin --clusterrole=cluster-admin --user=kubeflow-
[email protected] exited with code 1' returned non-zero exit status 1

Did you solve it?

@jinchihe
Copy link
Member Author

jinchihe commented Dec 5, 2019

@gaocegege Yes, I commented out related code to create clusterrolebinding. That works fine now.

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: d6c470e0-1742-11ea-b5b2-a33d2660b61c

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: da6bc2a0-1771-11ea-b5b2-a33d2660b61c

@jinchihe
Copy link
Member Author

jinchihe commented Dec 6, 2019

/retest

@jinchihe
Copy link
Member Author

jinchihe commented Dec 6, 2019

Why another problem during set up cluster? :-( Seems also not related with the code change :-(

Traceback (most recent call last):
 File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
   "__main__", fname, loader, pkg_name)
 File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
   exec code in run_globals
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-1b41797-7040-a3f4/src/kubeflow/tf-operator/py/kubeflow/tf_operator/de
ploy.py", line 359, in <module>
   main()
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-1b41797-7040-a3f4/src/kubeflow/tf-operator/py/kubeflow/tf_operator/de
ploy.py", line 355, in main
   args.func(args)
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 49, in wrapped_f
   return Retrying(*dargs, **dkw).call(f, *args, **kw)
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 212, in call
   raise attempt.get()
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 247, in get
   six.reraise(self.value[0], self.value[1], self.value[2])
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 200, in call
   attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-1b41797-7040-a3f4/src/kubeflow/tf-operator/py/kubeflow/tf_operator/de
ploy.py", line 168, in setup_cluster
   util.setup_cluster(api_client)
...
W
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 200, in call
   attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-1b41797-7040-a3f4/src/kubeflow/tf-operator/py/kubeflow/tf_operator/de
ploy.py", line 168, in setup_cluster
   util.setup_cluster(api_client)
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-1b41797-7040-a3f4/src/kubeflow/testing/py/kubeflow/testing/util.py",
line 692, in setup_cluster
   install_gpu_drivers(api_client)
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-1b41797-7040-a3f4/src/kubeflow/testing/py/kubeflow/testing/util.py",
line 638, in install_gpu_drivers
   ext_client.create_namespaced_daemon_set(namespace, daemonset_spec)
 File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 58, in create_namespaced_daemon_set
   (data) = self.create_namespaced_daemon_set_with_http_info(namespace, body, **kwargs)
 File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 143, in create_namespaced_daemon_se
t_with_http_info
   collection_formats=collection_formats)
 File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 321, in call_api
   _return_http_data_only, collection_formats, _preload_content, _request_timeout)
 File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 155, in __call_api
   _request_timeout=_request_timeout)
 File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 364, in request
   body=body)
 File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py", line 266, in POST
   body=body)
 File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py", line 222, in request
   raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Date': 'Fri, 06 Dec 2019 01:45:37 GMT', 'Audit-Id': '75dd6c49-0b89-4e95-bf67-2c544b71e848', 'Cont
ent-Length': '213', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"the API version in the data (apps/v1)
does not match the expected API version (extensions/v1beta1)","reason":"BadRequest","code":400}

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 5af30b30-17d2-11ea-b5b2-a33d2660b61c

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 80370650-17e9-11ea-95f4-abb8efddad88

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 9aec5910-17ed-11ea-95f4-abb8efddad88

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 2a125e00-17f3-11ea-95f4-abb8efddad88

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: c7d82f20-17fd-11ea-95f4-abb8efddad88

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 7bfb12e0-1800-11ea-95f4-abb8efddad88

@jinchihe
Copy link
Member Author

jinchihe commented Dec 6, 2019

The CI tests hangs in setup kubeflow step.

INFO:root:level=error msg="handle object: patching object from cluster: merging object with existing state: configmaps \"jupyterhub-confi
g\" is forbidden: User \"[email protected]\" cannot get resource \"configmaps\" in API group \"\" in t
he namespace \"kubeflow\""

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: ddefcbf0-1803-11ea-95f4-abb8efddad88

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 1b82cd60-180d-11ea-95f4-abb8efddad88

@gaocegege
Copy link
Member

/cc @richardsliu @jlewi Do you have idea about it?

@jinchihe
Copy link
Member Author

jinchihe commented Dec 6, 2019

@gaocegege The above problem already passed now, the only failed is simple-tfjob-tests-v1 timeout. I'm going to take a look.

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: c55b8a00-1837-11ea-95f4-abb8efddad88

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 7ab7d3f0-1841-11ea-95f4-abb8efddad88

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 980d25f0-1845-11ea-95f4-abb8efddad88

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 5b895eb0-1849-11ea-95f4-abb8efddad88

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: fa0f28a0-184c-11ea-95f4-abb8efddad88

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: b842fe70-1850-11ea-95f4-abb8efddad88

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 9391b980-1851-11ea-95f4-abb8efddad88

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 080caf70-1854-11ea-95f4-abb8efddad88

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 84ea0f60-189f-11ea-95f4-abb8efddad88

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 0c4895b0-18a7-11ea-95f4-abb8efddad88

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: 3e248870-18aa-11ea-9822-dbdeb0b9a0b2

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: fdf61010-18aa-11ea-9822-dbdeb0b9a0b2

@jinchihe jinchihe changed the title WIP: generate TFJob Python SDK Generate TFJob Python SDK Dec 7, 2019
@jinchihe
Copy link
Member Author

jinchihe commented Dec 7, 2019

/cc @gaocegege @richardsliu @jlewi @animeshsingh

The PR is ready for reviewing now. Thanks!

Next, will consider how to merge some useful functions in current py/kubeflow, especially py/kubeflow/tf_operator/tf_job_client.py, so that user can use that directly once install kubeflow-tfjob SDK from pip.

@gaocegege
Copy link
Member

@jinchihe Thanks for your awesome contribution!

@animeshsingh
Copy link

Thanks @jinchihe - great work!

@TravisBuddy
Copy link

Hey @jinchihe,
Your changes look good to me!

View build log

TravisBuddy Request Identifier: b904b620-1a24-11ea-a744-67689bca182d

@gaocegege
Copy link
Member

/lgtm

@johnugeorge
Copy link
Member

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit f3509e6 into kubeflow:master Dec 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OpenAPI Client Generation for Java, Python
8 participants