Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed set up cluster while installing GPU Drivers #537

Closed
jinchihe opened this issue Dec 6, 2019 · 4 comments · Fixed by #538
Closed

Failed set up cluster while installing GPU Drivers #537

jinchihe opened this issue Dec 6, 2019 · 4 comments · Fixed by #538

Comments

@jinchihe
Copy link
Member

jinchihe commented Dec 6, 2019

In TFJob CI tests, Failed set up cluster while installing GPU Drivers, as below.

INFO:root:Creating namespace kubeflow
INFO:root:Namespace kubeflow already exists.
INFO:root:GPUs detected in cluster.
INFO:root:Install GPU Drivers.
INFO:root:Using daemonset file: https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-
installer/cos/daemonset-preloaded.yaml
INFO:root:Creating /mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-00ff6f2-6420-8186/output/artifacts/junit_setupcluster.xml
Traceback (most recent call last):
 File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
   "__main__", fname, loader, pkg_name)
 File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
   exec code in run_globals
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-00ff6f2-6420-8186/src/kubeflow/tf-operator/py/kubeflow/tf_operator/de
ploy.py", line 359, in <module>
   main()
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-00ff6f2-6420-8186/src/kubeflow/tf-operator/py/kubeflow/tf_operator/de
ploy.py", line 355, in main
   args.func(args)
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 49, in wrapped_f
   return Retrying(*dargs, **dkw).call(f, *args, **kw)
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 212, in call
   raise attempt.get()
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 247, in get
   six.reraise(self.value[0], self.value[1], self.value[2])
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 200, in call
W
 File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 200, in call
   attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-00ff6f2-6420-8186/src/kubeflow/tf-operator/py/kubeflow/tf_operator/de
ploy.py", line 168, in setup_cluster
   util.setup_cluster(api_client)
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-00ff6f2-6420-8186/src/kubeflow/testing/py/kubeflow/testing/util.py",
line 692, in setup_cluster
   install_gpu_drivers(api_client)
 File "/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1103-00ff6f2-6420-8186/src/kubeflow/testing/py/kubeflow/testing/util.py",
line 638, in install_gpu_drivers
   ext_client.create_namespaced_daemon_set(namespace, daemonset_spec)
 File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 58, in create_namespaced_daemon_set
   (data) = self.create_namespaced_daemon_set_with_http_info(namespace, body, **kwargs)
 File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 143, in create_namespaced_daemon_se
t_with_http_info
   collection_formats=collection_formats)
 File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 321, in call_api
   _return_http_data_only, collection_formats, _preload_content, _request_timeout)
 File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 155, in __call_api
   _request_timeout=_request_timeout)
 File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 364, in request
   body=body)
 File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py", line 266, in POST
   body=body)
 File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py", line 222, in request
   raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Date': 'Fri, 06 Dec 2019 02:46:33 GMT', 'Audit-Id': '5036d786-8d57-4d82-84b2-d1457d3c9c4a', 'Cont
ent-Length': '213', 'Content-Type': 'application/json'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"the API version in the data (apps/v1)
does not match the expected API version (extensions/v1beta1)","reason":"BadRequest","code":400}
@jinchihe
Copy link
Member Author

jinchihe commented Dec 6, 2019

I think that caused by the DaemonSet is not consistent with Cluster, two way to fix the problems,

  1. specify/upgrade cluster version? not sure because I think the 1.13 has be apps/v1 of API Version for daemonset. Or caused by k8s client version issue?
  2. rollback the Daemonset to below:
    https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/k8s-1.8/device-plugin-daemonset.yaml

@jinchihe
Copy link
Member Author

jinchihe commented Dec 6, 2019

/cc @jlewi Any idea for this?

Related with the PR kubeflow/training-operator#1103

@jinchihe
Copy link
Member Author

jinchihe commented Dec 6, 2019

I think the root cause may be here

ext_client = k8s_client.ExtensionsV1beta1Api(api_client)

We should use kubernetes.client.AppsV1Api, instead of kubernetes.client.ExtensionsV1beta1Api since the Daemonset version is app/v1 in below file.
https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/k8s-1.8/device-plugin-daemonset.yaml

@jinchihe
Copy link
Member Author

jinchihe commented Dec 6, 2019

Root cause, the device-plugin-daemonset.yaml is updated 10 days ago from ExtensionsV1beta1Api to app/v1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant