Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E Test for TFServing with GPUs #291

Closed
jlewi opened this issue Feb 24, 2018 · 7 comments · May be fixed by opendatahub-io/kubeflow#326
Closed

E2E Test for TFServing with GPUs #291

jlewi opened this issue Feb 24, 2018 · 7 comments · May be fixed by opendatahub-io/kubeflow#326

Comments

@jlewi
Copy link
Contributor

jlewi commented Feb 24, 2018

We need an E2E test for TF Serving with GPUs.

As part of this we should built it continuously with prow.

@jlewi
Copy link
Contributor Author

jlewi commented Mar 5, 2018

I have some cycles to work on this.
I'm going to start by adding a ksonnet component to our E2E test to deploy with GPUs.

@jlewi jlewi assigned jlewi and unassigned lluunn Mar 5, 2018
jlewi added a commit to jlewi/kubeflow that referenced this issue Mar 6, 2018
…GPUs.

* This is the first step to creating an E2E for the GPU serving kubeflow#291.
* This deployment is suitable for testing that we can deploy the GPU container
  and not have it crash because of linking errors.

* This caught a bug in the Dockerfile.

* Fix the Docker file for the GPU image; we need to remove the symbolic links from /usr/local/nvidia to /usr/local/cuda

* On GKE the device plugin will make drivers available at /usr/local/nvidia and we don't want this to
override /usr/local/cuda
Related to kubeflow#291
@jlewi
Copy link
Contributor Author

jlewi commented Mar 7, 2018

Bump to P1 since we want to have GPU serving in our 0.1 release.

@lluunn How can we serve a model on GPUs and verify that GPUs were actually used?

k8s-ci-robot pushed a commit that referenced this issue Mar 7, 2018
…GPUs. (#362)

* This is the first step to creating an E2E for the GPU serving #291.
* This deployment is suitable for testing that we can deploy the GPU container
  and not have it crash because of linking errors.

* This caught a bug in the Dockerfile.

* Fix the Docker file for the GPU image; we need to remove the symbolic links from /usr/local/nvidia to /usr/local/cuda

* On GKE the device plugin will make drivers available at /usr/local/nvidia and we don't want this to
override /usr/local/cuda
Related to #291
@jlewi
Copy link
Contributor Author

jlewi commented Mar 7, 2018

https://stackoverflow.com/questions/42630762/how-to-verify-tensorflow-serving-is-using-gpus-on-a-gpu-instance
Suggests looking at nvidia-smi output
Some of these metrics should now be available via stackdriver

@jlewi
Copy link
Contributor Author

jlewi commented Mar 7, 2018

@lluunn This is blocked on #292 and the changes I requested in #383 to pass in a list of parameters to set on the ksonnet component.

Once those are fixed do you want to pick this up? I think the next step would be adding appropriate steps to our E2E workflow to run the test using GPUs just like we do with CPUs.

@lluunn
Copy link
Contributor

lluunn commented Mar 9, 2018

I am changing the cluster to kubeflow-ci, which has GPU pool.
kubeflow/testing#18

@jlewi
Copy link
Contributor Author

jlewi commented Mar 19, 2018

@lluunn Any update on the E2E test?

@lluunn
Copy link
Contributor

lluunn commented Mar 19, 2018

WIP #442

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants