Seldon not creating services for NVIDIA TRT Deployment #826

damitkwr · 2019-08-28T18:43:39Z

Hi, here is the yaml to reproduce the action and feel free to substitute the images for actual NVIDIA TRT deployments. This exact YAML works on seldon v0.3.1 but does not work for v0.3.2 and above. This YAML will only create one service. The issue is discussed in detail in: https://seldondev.slack.com/archives/C8Y9A8G0Y/p1567009614010500

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  labels:
    app: seldon
  name: nvidia-sp
  namespace: seldon
spec:
  name: trt-sp
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - image: gcr.io/[masked]/[masked]/[masked]:0.3
          resources:
            requests:
              cpu: '2'
          name: sp-predictor
        - args:
          - "--model-store=gs://[masked]/search/[masked]"
          - "--http-port=2000"
          - "--grpc-port=2001"
          command:
          - trtserver
          image: nvcr.io/nvidia/tensorrtserver:19.07-py3
          name: inference-server
          ports:
          - name: server
            containerPort: 2000
            targetPort: 2000
            protocol: TCP
          - containerPort: 2001
            protocol: TCP
          - containerPort: 2002
            protocol: TCP
          resources:
            limits:
              nvidia.com/gpu: '1'
            requests:
              cpu: '1'
              nvidia.com/gpu: '1'
          securityContext:
            runAsUser: 1000
        terminationGracePeriodSeconds: 1
        imagePullSecrets:
        - name: ngc
    graph:
      name: sp-predictor
      endpoint:
        type: REST
      type: MODEL
      children: []
      parameters:
      - name: url
        type: STRING
        value: localhost:2001
      - name: model_name
        type: STRING
        value: spelling_model
      - name: protocol
        type: STRING
        value: grpc
    svcOrchSpec:
      resources:
        requests:
          cpu: '1'
      env: []
    name: sp-nvidia
    replicas: 1

The text was updated successfully, but these errors were encountered:

ukclivecox · 2019-08-29T16:09:39Z

Does it create a deployment and do all the containers in that deployment run?
Can you add logs from the manager and the output of kubectl describe

damitkwr · 2019-08-29T17:34:52Z

It creates a deployment and all the containers run including the seldon-engine and istio proxies. The only thing I can see that is different is the services are not created right. As you mentioned, it should create two services but I only see one service with seldon-{hashcode}.

Here are the logs from the manager:

https://gist.github.com/damitkwr/a25c9bc0ca5361ccddd662a427fe5ee9

What resources do you want for the kubectl describe command?

ukclivecox · 2019-08-29T17:37:42Z

for the pod created

…

On Thu, 29 Aug 2019, 18:34 damitkwr, ***@***.***> wrote: It creates a deployment and all the containers run including the seldon-engine and istio proxies. The only thing I can see that is different is the services are not created right. As you mentioned, it should create two services but I only see one service with seldon-{hashcode}. Here are the logs from the manager: https://drive.google.com/file/d/1ylf4I6kz_VdWr-nyo1U51sqAFivu1r8-/view?usp=sharing What resources do you want for the kubectl describe command? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#826?email_source=notifications&email_token=ACQS4AZA57Y3O5PGX2A4E43QHACD5A5CNFSM4IRPTMZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5PH5GY#issuecomment-526286491>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACQS4A6YN7JSYI557J4OKGLQHACD5ANCNFSM4IRPTMZA> .

damitkwr · 2019-08-29T17:52:22Z

Here it is:

https://gist.github.com/damitkwr/d4bde9d94eb4b85bfa4ec90012e8fce6

damitkwr · 2019-08-29T18:13:18Z

@cliveseldon I put the logs in a gist now. You should be able to view it.

ukclivecox · 2019-08-29T18:13:36Z

sp-predictor-istio:
    Container ID:   docker://49d4fec1a8d30ba939abd63ed8df90221f40f0446f75d0c28af6f3da372a8c71
    Image:          gcr.io/gn-data-science-project02/dev-models/sp-predictor:0.3
    Image ID:       docker-pullable://gcr.io/gn-data-science-project02/dev-models/sp-predictor@sha256:4f70504824cc3d0463d27055f3e065469ec3129709601e6e5490ec67c07d8452
    Port:           9000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Thu, 29 Aug 2019 17:49:44 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1

Your container failed.

damitkwr · 2019-08-29T18:15:25Z

The was the previous state. If you look at the state now it says its running

ukclivecox · 2019-08-29T18:21:55Z

Sorry yes.

Are you able to get the raw logs using kubectl logs on the manager?

damitkwr · 2019-08-29T18:30:48Z

For some reason, the seldon-operator-controller-manager pod is not found as a pod with kubectl logs. Would these logs be any different from the logs I gave you from stackdriver?

ukclivecox · 2019-08-29T18:32:12Z

There should be a pod in seldon-system.
Its hard to read the logs from stackdriver and I'm not sure its clear it getting all of them.

damitkwr · 2019-08-29T18:33:05Z

Nvm, I was in the wrong namespace. Here are the logs:

https://gist.github.com/damitkwr/1d151c6c7786967ad04c1b0f6eeaaaad

ukclivecox · 2019-08-29T18:36:14Z

ok - found the issue. There is a procMount being added which is causing the Controller to keep thinking the Deployment needs updating. Will look into fixing this.

ukclivecox · 2019-08-29T18:37:23Z

For further background. We need to compare the Deployment we want with what is running and if different then update. This is made complex by Kubernetes adding defaults to some fields.
We need to look at this some more to make it more resilient.

ukclivecox · 2019-08-29T18:38:18Z

Thanks for helping to find this issue. Hopefull we can do a fix tomorrow.

damitkwr · 2019-08-29T18:38:40Z

No worries, it's awesome that you guys can respond this fast!

ukclivecox · 2019-08-30T08:55:16Z

I have updated - can you check with seldonio/seldon-core-operator:0.4.1-SNAPSHOT latest

damitkwr · 2019-08-30T16:14:06Z

It works now. Awesome! Thank you!

ukclivecox added the bug label Aug 28, 2019

ukclivecox self-assigned this Aug 28, 2019

ukclivecox added this to the 1.0.x milestone Aug 28, 2019

damitkwr closed this as completed Aug 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seldon not creating services for NVIDIA TRT Deployment #826

Seldon not creating services for NVIDIA TRT Deployment #826

damitkwr commented Aug 28, 2019 •

edited

Loading

ukclivecox commented Aug 29, 2019

damitkwr commented Aug 29, 2019 •

edited

Loading

ukclivecox commented Aug 29, 2019 via email

damitkwr commented Aug 29, 2019

damitkwr commented Aug 29, 2019

ukclivecox commented Aug 29, 2019

damitkwr commented Aug 29, 2019

ukclivecox commented Aug 29, 2019

damitkwr commented Aug 29, 2019

ukclivecox commented Aug 29, 2019

damitkwr commented Aug 29, 2019

ukclivecox commented Aug 29, 2019

ukclivecox commented Aug 29, 2019

ukclivecox commented Aug 29, 2019

damitkwr commented Aug 29, 2019

ukclivecox commented Aug 30, 2019

damitkwr commented Aug 30, 2019

Seldon not creating services for NVIDIA TRT Deployment #826

Seldon not creating services for NVIDIA TRT Deployment #826

Comments

damitkwr commented Aug 28, 2019 • edited Loading

ukclivecox commented Aug 29, 2019

damitkwr commented Aug 29, 2019 • edited Loading

ukclivecox commented Aug 29, 2019 via email

damitkwr commented Aug 29, 2019

damitkwr commented Aug 29, 2019

ukclivecox commented Aug 29, 2019

damitkwr commented Aug 29, 2019

ukclivecox commented Aug 29, 2019

damitkwr commented Aug 29, 2019

ukclivecox commented Aug 29, 2019

damitkwr commented Aug 29, 2019

ukclivecox commented Aug 29, 2019

ukclivecox commented Aug 29, 2019

ukclivecox commented Aug 29, 2019

damitkwr commented Aug 29, 2019

ukclivecox commented Aug 30, 2019

damitkwr commented Aug 30, 2019

damitkwr commented Aug 28, 2019 •

edited

Loading

damitkwr commented Aug 29, 2019 •

edited

Loading