Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seldon not creating services for NVIDIA TRT Deployment #826

Closed
damitkwr opened this issue Aug 28, 2019 · 17 comments
Closed

Seldon not creating services for NVIDIA TRT Deployment #826

damitkwr opened this issue Aug 28, 2019 · 17 comments
Assignees
Labels
Milestone

Comments

@damitkwr
Copy link

damitkwr commented Aug 28, 2019

Hi, here is the yaml to reproduce the action and feel free to substitute the images for actual NVIDIA TRT deployments. This exact YAML works on seldon v0.3.1 but does not work for v0.3.2 and above. This YAML will only create one service. The issue is discussed in detail in: https://seldondev.slack.com/archives/C8Y9A8G0Y/p1567009614010500

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  labels:
    app: seldon
  name: nvidia-sp
  namespace: seldon
spec:
  name: trt-sp
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - image: gcr.io/[masked]/[masked]/[masked]:0.3
          resources:
            requests:
              cpu: '2'
          name: sp-predictor
        - args:
          - "--model-store=gs://[masked]/search/[masked]"
          - "--http-port=2000"
          - "--grpc-port=2001"
          command:
          - trtserver
          image: nvcr.io/nvidia/tensorrtserver:19.07-py3
          name: inference-server
          ports:
          - name: server
            containerPort: 2000
            targetPort: 2000
            protocol: TCP
          - containerPort: 2001
            protocol: TCP
          - containerPort: 2002
            protocol: TCP
          resources:
            limits:
              nvidia.com/gpu: '1'
            requests:
              cpu: '1'
              nvidia.com/gpu: '1'
          securityContext:
            runAsUser: 1000
        terminationGracePeriodSeconds: 1
        imagePullSecrets:
        - name: ngc
    graph:
      name: sp-predictor
      endpoint:
        type: REST
      type: MODEL
      children: []
      parameters:
      - name: url
        type: STRING
        value: localhost:2001
      - name: model_name
        type: STRING
        value: spelling_model
      - name: protocol
        type: STRING
        value: grpc
    svcOrchSpec:
      resources:
        requests:
          cpu: '1'
      env: []
    name: sp-nvidia
    replicas: 1
@ukclivecox ukclivecox added the bug label Aug 28, 2019
@ukclivecox ukclivecox self-assigned this Aug 28, 2019
@ukclivecox ukclivecox added this to the 1.0.x milestone Aug 28, 2019
@ukclivecox
Copy link
Contributor

Does it create a deployment and do all the containers in that deployment run?
Can you add logs from the manager and the output of kubectl describe

@damitkwr
Copy link
Author

damitkwr commented Aug 29, 2019

It creates a deployment and all the containers run including the seldon-engine and istio proxies. The only thing I can see that is different is the services are not created right. As you mentioned, it should create two services but I only see one service with seldon-{hashcode}.

Here are the logs from the manager:

https://gist.github.com/damitkwr/a25c9bc0ca5361ccddd662a427fe5ee9

What resources do you want for the kubectl describe command?

@ukclivecox
Copy link
Contributor

ukclivecox commented Aug 29, 2019 via email

@damitkwr
Copy link
Author

@damitkwr
Copy link
Author

@cliveseldon I put the logs in a gist now. You should be able to view it.

@ukclivecox
Copy link
Contributor

sp-predictor-istio:
    Container ID:   docker://49d4fec1a8d30ba939abd63ed8df90221f40f0446f75d0c28af6f3da372a8c71
    Image:          gcr.io/gn-data-science-project02/dev-models/sp-predictor:0.3
    Image ID:       docker-pullable://gcr.io/gn-data-science-project02/dev-models/sp-predictor@sha256:4f70504824cc3d0463d27055f3e065469ec3129709601e6e5490ec67c07d8452
    Port:           9000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Thu, 29 Aug 2019 17:49:44 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1

Your container failed.

@damitkwr
Copy link
Author

The was the previous state. If you look at the state now it says its running

@ukclivecox
Copy link
Contributor

Sorry yes.

Are you able to get the raw logs using kubectl logs on the manager?

@damitkwr
Copy link
Author

For some reason, the seldon-operator-controller-manager pod is not found as a pod with kubectl logs. Would these logs be any different from the logs I gave you from stackdriver?

@ukclivecox
Copy link
Contributor

There should be a pod in seldon-system.
Its hard to read the logs from stackdriver and I'm not sure its clear it getting all of them.

@damitkwr
Copy link
Author

Nvm, I was in the wrong namespace. Here are the logs:

https://gist.github.com/damitkwr/1d151c6c7786967ad04c1b0f6eeaaaad

@ukclivecox
Copy link
Contributor

ok - found the issue. There is a procMount being added which is causing the Controller to keep thinking the Deployment needs updating. Will look into fixing this.

@ukclivecox
Copy link
Contributor

For further background. We need to compare the Deployment we want with what is running and if different then update. This is made complex by Kubernetes adding defaults to some fields.
We need to look at this some more to make it more resilient.

@ukclivecox
Copy link
Contributor

Thanks for helping to find this issue. Hopefull we can do a fix tomorrow.

@damitkwr
Copy link
Author

No worries, it's awesome that you guys can respond this fast!

@ukclivecox
Copy link
Contributor

I have updated - can you check with seldonio/seldon-core-operator:0.4.1-SNAPSHOT latest

@damitkwr
Copy link
Author

It works now. Awesome! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants