Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU support with SERVICE_TYPE Model #590

Closed
muma378 opened this issue May 26, 2019 · 10 comments
Closed

GPU support with SERVICE_TYPE Model #590

muma378 opened this issue May 26, 2019 · 10 comments

Comments

@muma378
Copy link

muma378 commented May 26, 2019

Hi, I was trying to deploy a SeldonDeployment to the cluster, which asks for gpu resource and CUDA. I writes the .yaml as the official doc suggests, however the deployment was blocked at "parsing" the CRD stage, which results in no deployment or service was created. It was totally OK to deploy models without gpu required.

I didn't find any example on using gpus, so, my question is: does Seldon-core support GPU? Or does anyone has succeeded in deploying a model with gpu required?

This is a part from my .yaml:

  predictors:
  - annotations:
      predictor_version: v1
    componentSpecs:
    - spec:
        containers:
        - image: xxx
          resources:
            limits:
              alpha.kubernetes.io/nvidia-gpu: 1
          imagePullPolicy: IfNotPresent
          name: xx
          volumeMounts:
          - mountPath: /usr/local/nvidia/bin
            name: bin
          - mountPath: /usr/lib/nvidia
            name: lib
        imagePullSecrets:
        - name: regcred
        terminationGracePeriodSeconds: 1
        volumes:
          - hostPath:
              path: /usr/lib/nvidia-384/bin
            name: bin
          - hostPath:
              path: /usr/lib/nvidia-384
            name: lib
    graph:
      children: []
      endpoint:
        type: GRPC
      name: xx
      type: MODEL
    name: xx
    replicas: 1
@ukclivecox
Copy link
Contributor

Can you provide more details on the error at "parsing" stage?
Are you using the "master" branch or an earlier version of Seldon Core?

@muma378
Copy link
Author

muma378 commented May 26, 2019

Can you provide more details on the error at "parsing" stage?
Are you using the "master" branch or an earlier version of Seldon Core?

No, the version for seldon-core and seldon-core-crd both are 0.2.5. Installed with helm locally.

The seldon-core-apiserver reports an error below:

2019-05-26 15:41:08.623  INFO 1 --- [pool-3-thread-1] io.seldon.apife.k8s.DeploymentWatcher    : The time is now 15:41:08
2019-05-26 15:41:08.623  INFO 1 --- [pool-3-thread-1] io.seldon.apife.k8s.DeploymentWatcher    : Watching with rs 3980232
2019-05-26 15:41:08.682  INFO 1 --- [pool-3-thread-1] io.seldon.apife.k8s.DeploymentWatcher    : ADDED
 : {"apiVersion":"machinelearning.seldon.io/v1alpha2","kind":"SeldonDeployment","metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"machinelearning.seldon.io/v1alpha2\",\"kind\":\"SeldonDeployment\",\"metadata\":{\"annotations\":{},\"labels\":{\"app\":\"seldon\",\"ksonnet.io/component\":\"facedet\"},\"name\":\"facedet-gpu\",\"namespace\":\"modelzoo\"},\"spec\":{\"annotations\":{\"deployment_version\":\"v1\",\"project_name\":\"facedet\",\"seldon.io/grpc-read-timeout\":\"60000\",\"seldon.io/rest-connection-timeout\":\"60000\",\"seldon.io/rest-read-timeout\":\"60000\"},\"name\":\"facedet-gpu\",\"oauth_key\":\"\",\"oauth_secret\":\"\",\"predictors\":[{\"annotations\":{\"predictor_version\":\"v1\"},\"componentSpecs\":[{\"spec\":{\"containers\":[{\"image\":\"facedet-gpu:v0.1\",\"imagePullPolicy\":\"IfNotPresent\",\"name\":\"facedet\",\"resources\":{\"limits\":{\"alpha.kubernetes.io/nvidia-gpu\":1}},\"volumeMounts\":[{\"mountPath\":\"/usr/local/nvidia/bin\",\"name\":\"bin\"},{\"mountPath\":\"/usr/lib/nvidia\",\"name\":\"lib\"}]}],\"imagePullSecrets\":[{\"name\":\"regcred\"}],\"terminationGracePeriodSeconds\":1,\"volumes\":[{\"hostPath\":{\"path\":\"/usr/lib/nvidia-384/bin\"},\"name\":\"bin\"},{\"hostPath\":{\"path\":\"/usr/lib/nvidia-384\"},\"name\":\"lib\"}]}}],\"graph\":{\"children\":[],\"endpoint\":{\"type\":\"GRPC\"},\"name\":\"facedet\",\"type\":\"MODEL\"},\"name\":\"facedet-gpu\",\"replicas\":1}]}}\n"},"creationTimestamp":"2019-05-26T15:40:22Z","generation":1.0,"labels":{"app":"seldon","ksonnet.io/component":"facedet"},"name":"facedet-gpu","namespace":"modelzoo","resourceVersion":"4317220","selfLink":"/apis/machinelearning.seldon.io/v1alpha2/namespaces/modelzoo/seldondeployments/facedet-gpu","uid":"931f9605-7fcc-11e9-a912-408d5c260149"},"spec":{"annotations":{"deployment_version":"v1","project_name":"facedet","seldon.io/grpc-read-timeout":"60000","seldon.io/rest-connection-timeout":"60000","seldon.io/rest-read-timeout":"60000"},"name":"facedet-gpu","oauth_key":"","oauth_secret":"","predictors":[{"annotations":{"predictor_version":"v1"},"componentSpecs":[{"spec":{"containers":[{"image":"xiaoyang0117/facedet-gpu:v0.1","imagePullPolicy":"IfNotPresent","name":"facedet","resources":{"limits":{"alpha.kubernetes.io/nvidia-gpu":1.0}},"volumeMounts":[{"mountPath":"/usr/local/nvidia/bin","name":"bin"},{"mountPath":"/usr/lib/nvidia","name":"lib"}]}],"imagePullSecrets":[{"name":"regcred"}],"terminationGracePeriodSeconds":1.0,"volumes":[{"hostPath":{"path":"/usr/lib/nvidia-384/bin"},"name":"bin"},{"hostPath":{"path":"/usr/lib/nvidia-384"},"name":"lib"}]}}],"graph":{"children":[],"endpoint":{"type":"GRPC"},"name":"facedet","type":"MODEL"},"name":"facedet-gpu","replicas":1.0}]}}
 2019-05-26 15:41:08.685 ERROR 1 --- [pool-3-thread-1] o.s.s.s.TaskUtils$LoggingErrorHandler    : Unexpected error occurred in scheduled task.
 com.google.protobuf.InvalidProtocolBufferException: Can't decode io.kubernetes.client.proto.resource.Quantity from 1.0
	at io.seldon.apife.pb.QuantityUtils$QuantityParser.merge(QuantityUtils.java:63) ~[classes!/:0.2.5]
	at io.seldon.apife.pb.JsonFormat$ParserImpl.merge(JsonFormat.java:1241) ~[classes!/:0.2.5]
	at io.seldon.apife.pb.JsonFormat$ParserImpl.parseFieldValue(JsonFormat.java:1797) ~[classes!/:0.2.5]
	at io.seldon.apife.pb.JsonFormat$ParserImpl.mergeMapField(JsonFormat.java:1484) ~[classes!/:0.2.5]
	at io.seldon.apife.pb.JsonFormat$ParserImpl.mergeField(JsonFormat.java:1458) ~[classes!/:0.2.5]
	at io.seldon.apife.pb.JsonFormat$ParserImpl.mergeMessage(JsonFormat.java:1294) ~[classes!/:0.2.5]
	at io.seldon.apife.pb.JsonFormat$ParserImpl.merge(JsonFormat.java:1252) ~[classes!/:0.2.5]
	at io.seldon.apife.pb.JsonFormat$ParserImpl.parseFieldValue(JsonFormat.java:1797) ~[classes!/:0.2.5]

@ukclivecox
Copy link
Contributor

Are you able to try this with the latest from master?

@muma378
Copy link
Author

muma378 commented May 27, 2019

Are you able to try this with the latest from master?

Not yet, the latest version is really hard to be installed for the Ambassador stuff, so I followed the guide in the example which uses version 0.2.5.
Do you mean, if I understood correctly, the changes were made in the latest versions?

@muma378
Copy link
Author

muma378 commented May 27, 2019

When I say "blocked in parsing stage", I mean I can saw the deployment name with kubectl get sdep -n namespace but got nothing with kubectl get deploy -n namespace.

@ukclivecox
Copy link
Contributor

If using 0.2.5 can you check the logs of the cluster-manager?

What problems are you having with Ambassador. In master you would install the official Ambassador helm chart.

The issue you are having I think is due to parsing of Quantity in the protobuffer specs. This should be fixed in the version in master which was why I was hoping you could test with latest?

@muma378
Copy link
Author

muma378 commented May 27, 2019

Yes, you are correct, I just found a similar scenario in issue #45 . I checked the cluster-manager, it is for the Quantity parsing.
Therefore, I changed the value to "1" like:

    - spec:
        containers:
        - image: my-image-name:v0.1
          resources:
            limits:
              nvidia.com/gpu: "1"

However, the cluster-manager reports another error:

2019-05-27 16:19:27.156 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.KubeCRDHandlerImpl             : Updating seldondeployment facedet-gpu with status state: "Failed"
description: "Can\'t find container for predictive unit with name facedet-gpu"
 2019-05-27 16:19:27.307 DEBUG 1 --- [pool-1-thread-1] i.s.c.k8s.SeldonDeploymentWatcher        : MODIFIED

Looks like unable to find the image, but apparently it was hosted. What else could be make this happen?

@ukclivecox
Copy link
Contributor

The name in the graph spec must match a container name. Looks like it can't find a container with name facedet-gpu

@muma378
Copy link
Author

muma378 commented May 28, 2019

Exactly! Changing the container name resolves my problem. Now I can see a deployment is being created. @cliveseldon Thanks very much for your patient!
At last, back to my confusion, technically seldon-core is OK with container using hardware accelerate
, right?

@ukclivecox
Copy link
Contributor

There should no issue. As long as your model image and Pod is correctly setup. We'd love to have an example in this area so happy to help you get everything working.

@muma378 muma378 closed this as completed May 28, 2019
agrski added a commit that referenced this issue Dec 2, 2022
* add prestop to raw yaml files

* check in helm changes for scaling down

* make default cpu for mlserver / triton to 1

* add scale endpoint to server CR

* changes from HPA PR#83 (from clive)

* update to CRDs

* add default triton cpu request

* remove init containers as we dont use them

* autogenerate server resource

* add agent and rclone cpu / memory requests

* update helm

* remove init container from autogen file

* k6 runner fix

* update scaling logic + tests

* fix draining empty server + adding tests

* when failedscheduling we can have available repls

* reduce cool down timer to 1 minute

* lint

* Add autoscaling docs

* add  terminationGracePeriodSeconds to helm

* add autogen files

* improve agent logging around scaling events

* move scaling logs to debug level

* Update docs/source/contents/kubernetes/autoscaling/index.md

Co-authored-by: Alex Rakowski <[email protected]>

* Update docs/source/contents/kubernetes/autoscaling/index.md

Co-authored-by: Alex Rakowski <[email protected]>

* Update docs/source/contents/kubernetes/autoscaling/index.md

Co-authored-by: Alex Rakowski <[email protected]>

* docs changes

* trim space from helm parameter value

* tidy up comment in reconciler

* update raw yaml files

Co-authored-by: Alex Rakowski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants