predict fails and seldondeployment missing .status #35

DavidLangworthy · 2019-04-19T20:35:20Z

@cliveseldon
Calling predict on a deployment that returned sucess fails with a connection error. Attempting to debug this reveals that .status is missing from seldondeployment. Sugestions for how to debug this?

!kubectl get seldondeployments mnist-classifier -o jsonpath='{.status}'

returns nothing

!kubectl get seldondeployments mnist-classifier -o json
returns
{
"apiVersion": "machinelearning.seldon.io/v1alpha2",
"kind": "SeldonDeployment",
"metadata": {
"annotations": {
"kubectl.kubernetes.io/last-applied-configuration": "{"apiVersion":"machinelearning.seldon.io/v1alpha2","kind":"SeldonDeployment","metadata":{"annotations":{},"labels":{"app":"seldon"},"name":"mnist-classifier","namespace":"kubeflow"},"spec":{"annotations":{"deployment_version":"v1","project_name":"MNIST Example","seldon.io/engine-separate-pod":"false","seldon.io/rest-connection-timeout":"100"},"name":"mnist-classifier","predictors":[{"annotations":{"predictor_version":"v1"},"componentSpecs":[{"spec":{"containers":[{"image":"seldonio/deepmnistclassifier_runtime:0.2","imagePullPolicy":"Always","name":"tf-model","volumeMounts":[{"mountPath":"/data","name":"persistent-storage"}]}],"terminationGracePeriodSeconds":1,"volumes":[{"name":"persistent-storage","volumeSource":{"persistentVolumeClaim":{"claimName":"nfs-1"}}}]}}],"graph":{"children":[],"endpoint":{"type":"REST"},"name":"tf-model","type":"MODEL"},"name":"mnist-classifier","replicas":1}]}}\n"
},
"creationTimestamp": "2019-04-18T21:26:32Z",
"generation": 1,
"labels": {
"app": "seldon"
},
"name": "mnist-classifier",
"namespace": "kubeflow",
"resourceVersion": "128631",
"selfLink": "/apis/machinelearning.seldon.io/v1alpha2/namespaces/kubeflow/seldondeployments/mnist-classifier",
"uid": "a3450e71-6220-11e9-a023-da0ed60f5a55"
},
"spec": {
"annotations": {
"deployment_version": "v1",
"project_name": "MNIST Example",
"seldon.io/engine-separate-pod": "false",
"seldon.io/rest-connection-timeout": "100"
},
"name": "mnist-classifier",
"predictors": [
{
"annotations": {
"predictor_version": "v1"
},
"componentSpecs": [
{
"spec": {
"containers": [
{
"image": "seldonio/deepmnistclassifier_runtime:0.2",
"imagePullPolicy": "Always",
"name": "tf-model",
"volumeMounts": [
{
"mountPath": "/data",
"name": "persistent-storage"
}
]
}
],
"terminationGracePeriodSeconds": 1,
"volumes": [
{
"name": "persistent-storage",
"volumeSource": {
"persistentVolumeClaim": {
"claimName": "nfs-1"
}
}
}
]
}
}
],
"graph": {
"children": [],
"endpoint": {
"type": "REST"
},
"name": "tf-model",
"type": "MODEL"
},
"name": "mnist-classifier",
"replicas": 1
}
]
}
}

ukclivecox · 2019-04-19T20:47:57Z

Can you check the logs of the cluster-manager and check the pods are running. There should always be a status so need to track this down further.

DavidLangworthy · 2019-04-19T20:54:47Z

What specifically do I need to look for? Kubeflow starts up so much it's hard to find my way around.

DavidLangworthy · 2019-04-19T20:56:42Z

!kubectl get pods -n kubeflow

NAME READY STATUS RESTARTS AGE
ambassador-c9647fb66-fl4zr 1/1 Running 0 1d
ambassador-c9647fb66-g6n9r 1/1 Running 0 1d
ambassador-c9647fb66-z7p27 1/1 Running 0 1d
argo-ui-755fcfc656-s2rgl 1/1 Running 0 1d
centraldashboard-7c948d9df6-jh8zj 1/1 Running 0 1d
jupyter-0 1/1 Running 0 1d
jupyter-web-app-6ffc57d749-mqtgr 0/1 CrashLoopBackOff 318 1d
katib-ui-6dc644d54-jg6mj 1/1 Running 0 1d
kubeflow-r-train-srxtq-1399384440 0/1 Completed 0 23h
kubeflow-sk-train-6llnn-122502152 0/1 Completed 0 23h
kubeflow-tf-train-nc5kg-1269457206 0/1 Completed 0 23h
metacontroller-0 1/1 Running 0 1d
minio-b7595688d-4xhbq 1/1 Running 0 1d
ml-pipeline-59459675dd-npjh6 1/1 Running 0 1d
ml-pipeline-persistenceagent-7f6d4555d7-hdkmn 1/1 Running 1 1d
ml-pipeline-scheduledworkflow-5f4d44fb4f-65xt9 1/1 Running 0 1d
ml-pipeline-ui-f5d595697-z8cl5 1/1 Running 0 1d
ml-pipeline-viewer-controller-deployment-5b4954fb4c-4ldm8 1/1 Running 0 1d
mnist-train-5-worker-0 0/1 Completed 0 23h
mykubeflowapp2-controller-b5677fccf-5fpsm 1/1 Running 0 1d
mysql-5b7578d9f5-8mjld 1/1 Running 0 1d
notebooks-controller-9c5f6b7f5-t2xlh 1/1 Running 0 1d
profiles-7bfcbd5f76-2ht9w 1/1 Running 0 1d
pytorch-operator-847d884f4d-cvwpm 1/1 Running 0 1d
r-train-mfs75 0/1 Completed 0 23h
sk-train-svnwb 0/1 Completed 0 23h
spartakus-volunteer-7787b4cf54-z79tj 1/1 Running 0 1d
studyjob-controller-5995857687-46xrn 1/1 Running 0 1d
tf-job-dashboard-c899cd664-94wtf 1/1 Running 0 1d
tf-job-operator-785546f859-rfzrm 1/1 Running 0 1d
vizier-core-6d56d75f76-969ks 1/1 Running 3 1d
vizier-core-rest-79bdbfbfb8-qnvz9 1/1 Running 0 1d
vizier-db-79d57d5667-f7nst 1/1 Running 0 1d
vizier-suggestion-bayesianoptimization-759f6c56c8-54p6x 1/1 Running 0 1d
vizier-suggestion-grid-59f7f5646d-fqcfg 1/1 Running 0 1d
vizier-suggestion-hyperband-84b8ddc658-xm9fb 1/1 Running 0 1d
vizier-suggestion-random-64b4467f6b-gptpl 1/1 Running 0 1d
workflow-controller-8564bd964f-df7x2 1/1 Running 0 1d

ukclivecox · 2019-04-19T21:01:10Z

I don't see the seldon cluster-manager. Did you install seldon as per the docs?

DavidLangworthy · 2019-04-19T21:13:20Z

Yes, but I gather it was not successful. I will try again.

Thank you

DavidLangworthy · 2019-04-19T21:43:08Z

The deployment worked this time and the cluster manager is up:
dlan@loadclient:~$ kubectl get pods --all-namespaces | grep seldon
kube-system seldon-spartakus-volunteer-57647c7679-vb6pt 1/1 Running 0 1d kubeflow seldon-core-ambassador-6bb6fb974d-qwg79 1/1 Running 0 1m
kubeflow seldon-core-redis-685dd67c95-grv2h 1/1
Running 0 1m
kubeflow seldon-core-seldon-cluster-manager-dd8497ccf-xtm46 1/1
Running 0 1m

However I am still getting an error calling the prediction service.

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

The port forward window gives me the following:

dlan@loadclient:~$ kubectl port-forward $(kubectl get pods -n kubeflow -l service=ambassador -o jsonpath='{.items[0].metadata.name}') -n kubeflow 8002:80
Forwarding from 127.0.0.1:8002 -> 80
Forwarding from [::1]:8002 -> 80
Handling connection for 8002
E0419 21:38:55.183309 12957 portforward.go:400] an error occurred forwarding 8002 -> 80: error forwarding port 80 to pod baa7cdd3e0fc3d4ce1d30ff49cd8602421ebce99f6895fdb5aa70e1e362051f9, uid : exit status 1: 2019/04/19 21:38:55 socat[9620] E connect(6, AF=2 127.0.0.1:80, 16): Connection refused
Handling connection for 8002
E0419 21:38:58.731598 12957 portforward.go:400] an error occurred forwarding 8002 -> 80: error forwarding port 80 to pod baa7cdd3e0fc3d4ce1d30ff49cd8602421ebce99f6895fdb5aa70e1e362051f9, uid : exit status 1: 2019/04/19 21:38:58 socat[9798] E connect(6, AF=2 127.0.0.1:80, 16): Connection refused
Handling connection for 8002
E0419 21:39:27.769533 12957 portforward.go:400] an error occurred forwarding 8002 -> 80: error forwarding port 80 to pod baa7cdd3e0fc3d4ce1d30ff49cd8602421ebce99f6895fdb5aa70e1e362051f9, uid : exit status 1: 2019/04/19 21:39:27 socat[10904] E connect(6, AF=2 127.0.0.1:80, 16): Connection refused

ukclivecox · 2019-04-20T04:22:48Z

OK. Can you check the Ambassador exposes port 80 or has moved to 8080 now?

DavidLangworthy · 2019-04-22T18:13:16Z

I have two ambassadors
ambassador ClusterIP 10.0.233.236 80/TCP
seldon-core-ambassador NodePort 10.0.158.182 80:30489/TCP, 443:31294/TCP

Thanks for your help.

ukclivecox · 2019-04-23T07:08:18Z

I would try connecting to both Ambassadors directly to see which ones work and also check the Ambassador diagnostics.

DavidLangworthy · 2019-04-23T07:23:27Z

I’ll try that. Thank you .

…

________________________________ From: cliveseldon <[email protected]> Sent: Tuesday, April 23, 2019 12:08:19 AM To: kubeflow/example-seldon Cc: David Langworthy; Author Subject: Re: [kubeflow/example-seldon] predict fails and seldondeployment missing .status (#35) I would try connecting to both Ambassadors directly to see which ones work and also check the Ambassador diagnostics. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubeflow%2Fexample-seldon%2Fissues%2F35%23issuecomment-485671054&data=02%7C01%7Cdlan%40microsoft.com%7Cb292245c5cba44bc346908d6c7ba7775%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636916001009305051&sdata=FntQsN1Xvz2yH%2F5f0%2BUxyys8QcCGugEsbesTvIePXis%3D&reserved=0>, or mute the thread<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA5NW6IVFDDVGPLHA3SC2A3PR2YWHANCNFSM4HHH2O5Q&data=02%7C01%7Cdlan%40microsoft.com%7Cb292245c5cba44bc346908d6c7ba7775%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636916001009315064&sdata=Up9cY4Zwa%2BSO0BUlp8J%2FOhrzsLqDz1FUKD5vrXiku%2Bc%3D&reserved=0>.

DavidLangworthy · 2019-04-24T17:35:26Z

I can hit the predictor directly and it works fine. The routes look fine in ambassador. However I do not see requests in the ambassador logs.

Any suggestions?

I'll keep looking around.

ukclivecox · 2019-05-07T06:09:43Z

Sorry, missed this. You won't see requests in the Ambassador logs by default I think as Ambassador doesn't logs every request. Are the requests working?

DavidLangworthy · 2019-05-07T15:35:20Z

The requests were not working. I've recycled this cluster. I'll bring up a fresh one and see if there is a repro.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

predict fails and seldondeployment missing .status #35

predict fails and seldondeployment missing .status #35

DavidLangworthy commented Apr 19, 2019

ukclivecox commented Apr 19, 2019

DavidLangworthy commented Apr 19, 2019

DavidLangworthy commented Apr 19, 2019

ukclivecox commented Apr 19, 2019

DavidLangworthy commented Apr 19, 2019

DavidLangworthy commented Apr 19, 2019

ukclivecox commented Apr 20, 2019

DavidLangworthy commented Apr 22, 2019

ukclivecox commented Apr 23, 2019

DavidLangworthy commented Apr 23, 2019 via email

DavidLangworthy commented Apr 24, 2019

ukclivecox commented May 7, 2019

DavidLangworthy commented May 7, 2019

predict fails and seldondeployment missing .status #35

predict fails and seldondeployment missing .status #35

Comments

DavidLangworthy commented Apr 19, 2019

ukclivecox commented Apr 19, 2019

DavidLangworthy commented Apr 19, 2019

DavidLangworthy commented Apr 19, 2019

ukclivecox commented Apr 19, 2019

DavidLangworthy commented Apr 19, 2019

DavidLangworthy commented Apr 19, 2019

ukclivecox commented Apr 20, 2019

DavidLangworthy commented Apr 22, 2019

ukclivecox commented Apr 23, 2019

DavidLangworthy commented Apr 23, 2019 via email

DavidLangworthy commented Apr 24, 2019

ukclivecox commented May 7, 2019

DavidLangworthy commented May 7, 2019