GPU support with SERVICE_TYPE Model #590

muma378 · 2019-05-26T10:43:35Z

Hi, I was trying to deploy a SeldonDeployment to the cluster, which asks for gpu resource and CUDA. I writes the .yaml as the official doc suggests, however the deployment was blocked at "parsing" the CRD stage, which results in no deployment or service was created. It was totally OK to deploy models without gpu required.

I didn't find any example on using gpus, so, my question is: does Seldon-core support GPU? Or does anyone has succeeded in deploying a model with gpu required?

This is a part from my .yaml:

  predictors:
  - annotations:
      predictor_version: v1
    componentSpecs:
    - spec:
        containers:
        - image: xxx
          resources:
            limits:
              alpha.kubernetes.io/nvidia-gpu: 1
          imagePullPolicy: IfNotPresent
          name: xx
          volumeMounts:
          - mountPath: /usr/local/nvidia/bin
            name: bin
          - mountPath: /usr/lib/nvidia
            name: lib
        imagePullSecrets:
        - name: regcred
        terminationGracePeriodSeconds: 1
        volumes:
          - hostPath:
              path: /usr/lib/nvidia-384/bin
            name: bin
          - hostPath:
              path: /usr/lib/nvidia-384
            name: lib
    graph:
      children: []
      endpoint:
        type: GRPC
      name: xx
      type: MODEL
    name: xx
    replicas: 1

The text was updated successfully, but these errors were encountered:

ukclivecox · 2019-05-26T10:45:35Z

Can you provide more details on the error at "parsing" stage?
Are you using the "master" branch or an earlier version of Seldon Core?

muma378 · 2019-05-26T15:46:16Z

Can you provide more details on the error at "parsing" stage?
Are you using the "master" branch or an earlier version of Seldon Core?

No, the version for seldon-core and seldon-core-crd both are 0.2.5. Installed with helm locally.

The seldon-core-apiserver reports an error below:

2019-05-26 15:41:08.623  INFO 1 --- [pool-3-thread-1] io.seldon.apife.k8s.DeploymentWatcher    : The time is now 15:41:08
2019-05-26 15:41:08.623  INFO 1 --- [pool-3-thread-1] io.seldon.apife.k8s.DeploymentWatcher    : Watching with rs 3980232
2019-05-26 15:41:08.682  INFO 1 --- [pool-3-thread-1] io.seldon.apife.k8s.DeploymentWatcher    : ADDED
 : {"apiVersion":"machinelearning.seldon.io/v1alpha2","kind":"SeldonDeployment","metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"machinelearning.seldon.io/v1alpha2\",\"kind\":\"SeldonDeployment\",\"metadata\":{\"annotations\":{},\"labels\":{\"app\":\"seldon\",\"ksonnet.io/component\":\"facedet\"},\"name\":\"facedet-gpu\",\"namespace\":\"modelzoo\"},\"spec\":{\"annotations\":{\"deployment_version\":\"v1\",\"project_name\":\"facedet\",\"seldon.io/grpc-read-timeout\":\"60000\",\"seldon.io/rest-connection-timeout\":\"60000\",\"seldon.io/rest-read-timeout\":\"60000\"},\"name\":\"facedet-gpu\",\"oauth_key\":\"\",\"oauth_secret\":\"\",\"predictors\":[{\"annotations\":{\"predictor_version\":\"v1\"},\"componentSpecs\":[{\"spec\":{\"containers\":[{\"image\":\"facedet-gpu:v0.1\",\"imagePullPolicy\":\"IfNotPresent\",\"name\":\"facedet\",\"resources\":{\"limits\":{\"alpha.kubernetes.io/nvidia-gpu\":1}},\"volumeMounts\":[{\"mountPath\":\"/usr/local/nvidia/bin\",\"name\":\"bin\"},{\"mountPath\":\"/usr/lib/nvidia\",\"name\":\"lib\"}]}],\"imagePullSecrets\":[{\"name\":\"regcred\"}],\"terminationGracePeriodSeconds\":1,\"volumes\":[{\"hostPath\":{\"path\":\"/usr/lib/nvidia-384/bin\"},\"name\":\"bin\"},{\"hostPath\":{\"path\":\"/usr/lib/nvidia-384\"},\"name\":\"lib\"}]}}],\"graph\":{\"children\":[],\"endpoint\":{\"type\":\"GRPC\"},\"name\":\"facedet\",\"type\":\"MODEL\"},\"name\":\"facedet-gpu\",\"replicas\":1}]}}\n"},"creationTimestamp":"2019-05-26T15:40:22Z","generation":1.0,"labels":{"app":"seldon","ksonnet.io/component":"facedet"},"name":"facedet-gpu","namespace":"modelzoo","resourceVersion":"4317220","selfLink":"/apis/machinelearning.seldon.io/v1alpha2/namespaces/modelzoo/seldondeployments/facedet-gpu","uid":"931f9605-7fcc-11e9-a912-408d5c260149"},"spec":{"annotations":{"deployment_version":"v1","project_name":"facedet","seldon.io/grpc-read-timeout":"60000","seldon.io/rest-connection-timeout":"60000","seldon.io/rest-read-timeout":"60000"},"name":"facedet-gpu","oauth_key":"","oauth_secret":"","predictors":[{"annotations":{"predictor_version":"v1"},"componentSpecs":[{"spec":{"containers":[{"image":"xiaoyang0117/facedet-gpu:v0.1","imagePullPolicy":"IfNotPresent","name":"facedet","resources":{"limits":{"alpha.kubernetes.io/nvidia-gpu":1.0}},"volumeMounts":[{"mountPath":"/usr/local/nvidia/bin","name":"bin"},{"mountPath":"/usr/lib/nvidia","name":"lib"}]}],"imagePullSecrets":[{"name":"regcred"}],"terminationGracePeriodSeconds":1.0,"volumes":[{"hostPath":{"path":"/usr/lib/nvidia-384/bin"},"name":"bin"},{"hostPath":{"path":"/usr/lib/nvidia-384"},"name":"lib"}]}}],"graph":{"children":[],"endpoint":{"type":"GRPC"},"name":"facedet","type":"MODEL"},"name":"facedet-gpu","replicas":1.0}]}}
 2019-05-26 15:41:08.685 ERROR 1 --- [pool-3-thread-1] o.s.s.s.TaskUtils$LoggingErrorHandler    : Unexpected error occurred in scheduled task.
 com.google.protobuf.InvalidProtocolBufferException: Can't decode io.kubernetes.client.proto.resource.Quantity from 1.0
	at io.seldon.apife.pb.QuantityUtils$QuantityParser.merge(QuantityUtils.java:63) ~[classes!/:0.2.5]
	at io.seldon.apife.pb.JsonFormat$ParserImpl.merge(JsonFormat.java:1241) ~[classes!/:0.2.5]
	at io.seldon.apife.pb.JsonFormat$ParserImpl.parseFieldValue(JsonFormat.java:1797) ~[classes!/:0.2.5]
	at io.seldon.apife.pb.JsonFormat$ParserImpl.mergeMapField(JsonFormat.java:1484) ~[classes!/:0.2.5]
	at io.seldon.apife.pb.JsonFormat$ParserImpl.mergeField(JsonFormat.java:1458) ~[classes!/:0.2.5]
	at io.seldon.apife.pb.JsonFormat$ParserImpl.mergeMessage(JsonFormat.java:1294) ~[classes!/:0.2.5]
	at io.seldon.apife.pb.JsonFormat$ParserImpl.merge(JsonFormat.java:1252) ~[classes!/:0.2.5]
	at io.seldon.apife.pb.JsonFormat$ParserImpl.parseFieldValue(JsonFormat.java:1797) ~[classes!/:0.2.5]

ukclivecox · 2019-05-26T17:35:43Z

Are you able to try this with the latest from master?

muma378 · 2019-05-27T02:02:13Z

Are you able to try this with the latest from master?

Not yet, the latest version is really hard to be installed for the Ambassador stuff, so I followed the guide in the example which uses version 0.2.5.
Do you mean, if I understood correctly, the changes were made in the latest versions?

muma378 · 2019-05-27T16:01:44Z

When I say "blocked in parsing stage", I mean I can saw the deployment name with kubectl get sdep -n namespace but got nothing with kubectl get deploy -n namespace.

ukclivecox · 2019-05-27T16:21:21Z

If using 0.2.5 can you check the logs of the cluster-manager?

What problems are you having with Ambassador. In master you would install the official Ambassador helm chart.

The issue you are having I think is due to parsing of Quantity in the protobuffer specs. This should be fixed in the version in master which was why I was hoping you could test with latest?

muma378 · 2019-05-27T16:40:48Z

Yes, you are correct, I just found a similar scenario in issue #45 . I checked the cluster-manager, it is for the Quantity parsing.
Therefore, I changed the value to "1" like:

    - spec:
        containers:
        - image: my-image-name:v0.1
          resources:
            limits:
              nvidia.com/gpu: "1"

However, the cluster-manager reports another error:

2019-05-27 16:19:27.156 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.KubeCRDHandlerImpl             : Updating seldondeployment facedet-gpu with status state: "Failed"
description: "Can\'t find container for predictive unit with name facedet-gpu"
 2019-05-27 16:19:27.307 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.SeldonDeploymentWatcher        : MODIFIED

Looks like unable to find the image, but apparently it was hosted. What else could be make this happen?

ukclivecox · 2019-05-27T16:49:11Z

The name in the graph spec must match a container name. Looks like it can't find a container with name facedet-gpu

muma378 · 2019-05-28T07:20:14Z

Exactly! Changing the container name resolves my problem. Now I can see a deployment is being created. @cliveseldon Thanks very much for your patient!
At last, back to my confusion, technically seldon-core is OK with container using hardware accelerate
, right?

ukclivecox · 2019-05-28T08:29:00Z

There should no issue. As long as your model image and Pod is correctly setup. We'd love to have an example in this area so happy to help you get everything working.

* add prestop to raw yaml files * check in helm changes for scaling down * make default cpu for mlserver / triton to 1 * add scale endpoint to server CR * changes from HPA PR#83 (from clive) * update to CRDs * add default triton cpu request * remove init containers as we dont use them * autogenerate server resource * add agent and rclone cpu / memory requests * update helm * remove init container from autogen file * k6 runner fix * update scaling logic + tests * fix draining empty server + adding tests * when failedscheduling we can have available repls * reduce cool down timer to 1 minute * lint * Add autoscaling docs * add terminationGracePeriodSeconds to helm * add autogen files * improve agent logging around scaling events * move scaling logs to debug level * Update docs/source/contents/kubernetes/autoscaling/index.md Co-authored-by: Alex Rakowski <[email protected]> * Update docs/source/contents/kubernetes/autoscaling/index.md Co-authored-by: Alex Rakowski <[email protected]> * Update docs/source/contents/kubernetes/autoscaling/index.md Co-authored-by: Alex Rakowski <[email protected]> * docs changes * trim space from helm parameter value * tidy up comment in reconciler * update raw yaml files Co-authored-by: Alex Rakowski <[email protected]>

muma378 closed this as completed May 28, 2019

axsaucedo mentioned this issue Jun 5, 2019

Tensorflow GPU Example #619

Closed

JoelH96 mentioned this issue Jun 6, 2019

GPU Tensorflow Example #623

Closed

JoelH96 mentioned this issue Jun 20, 2019

Gpu tensorflow example #638

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU support with SERVICE_TYPE Model #590

GPU support with SERVICE_TYPE Model #590

muma378 commented May 26, 2019

ukclivecox commented May 26, 2019

muma378 commented May 26, 2019

ukclivecox commented May 26, 2019

muma378 commented May 27, 2019

muma378 commented May 27, 2019

ukclivecox commented May 27, 2019

muma378 commented May 27, 2019

ukclivecox commented May 27, 2019

muma378 commented May 28, 2019

ukclivecox commented May 28, 2019

GPU support with SERVICE_TYPE Model #590

GPU support with SERVICE_TYPE Model #590

Comments

muma378 commented May 26, 2019

ukclivecox commented May 26, 2019

muma378 commented May 26, 2019

ukclivecox commented May 26, 2019

muma378 commented May 27, 2019

muma378 commented May 27, 2019

ukclivecox commented May 27, 2019

muma378 commented May 27, 2019

ukclivecox commented May 27, 2019

muma378 commented May 28, 2019

ukclivecox commented May 28, 2019