Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue in deployments of multiple models #103

Closed
krishna-dahifale opened this issue Feb 27, 2018 · 13 comments
Closed

Issue in deployments of multiple models #103

krishna-dahifale opened this issue Feb 27, 2018 · 13 comments

Comments

@krishna-dahifale
Copy link

Hi,
I am able to make seldon-core up and running on kubernetes cluster and are able to get sample models(FX Market Prediction and sklearn_iris) running and checking response via postman tool.
Now if I am deploying both models on seldon-core one after another, I am able to get desired output of the most recent deployed model only.so my query is whether we can deploy multiple models on seldon-core or we can only deploy one model at a time.Also, we want to know how seldon-api server will figure out which model to access, on the basis of URL or anything else.

@Maximophone
Copy link
Contributor

Maximophone commented Feb 27, 2018

Hi Krishna,

How are you deploying your second graph? In the json, you need a unique oauth key for each graph. Then when you get a token for the API, you specify the key and secret corresponding to the graph you are planning to query.
I suspect you are using the same oauth key and secret for both deployments (as I think in the examples we always use the same key and secret) thus overwriting the first one.

@cliveseldon Definitely another thing we should clarify in the docs.

@ukclivecox
Copy link
Contributor

Added to above, just to confirm you need to give each deployment a separate name, plus separate oauth key and secret if using the API Frontend built in.
If using Ambassador your deployments will have different endpoints on the Ambassador reverse proxy. For a REST deployment this will be like: /seldon/. See the notebooks for examples.

@zhangwei730
Copy link

@krishna-dahifale hello !
when you said you can run the seldon-core and examples on kubernetes, what kind of kubernetes do you mean? the minikube cluster or the 'kubernetes'? though i can run the examples and seldon on minikube successfully, i failed when on the kubernetes cluster(it got no responding pod, i.e. sklearn--irisclassifier-, at all). wounder why

@ukclivecox
Copy link
Contributor

@zhangwei730 Can you give details of the kubernetes cluster you are using? GCP, bare-metal?

Have you followed the GCP examples in notebooks, e.g. https://github.com/SeldonIO/seldon-core/blob/master/notebooks/kubectl_demo_gcp.ipynb

@zhangwei730
Copy link

@cliveseldon thank you for your reply.
my kubernetes version is 1.8, with all other tools neccesary in kubernetes installed(etcd, kubectl, etc.), it was installed with one master node and one minion node(both are on a virtual centos7 system).

since you mentioned the GCP, i noticed that i haved installed anything GCP-related.
so i need to install kubernetes on a GCP cluster?

thank you!

@ukclivecox
Copy link
Contributor

@zhangwei730 No , seldon-core runs on any kubernetes cluster.

Can you explain the steps you ran and the particular issue you are experiencing in detail?

Thanks

@zhangwei730
Copy link

@cliveseldon Gladly YES!
The steps are as following:

  1. i successfully installed the kubernetes on the two virtual centos7 nodes, with docker\etcd\flanneld\kube-apiserver\kube-controller-manager\kube-scheduler\kubelet\kube-proxy are all 'activing(running)', though with the etcd that operates on the node, it note that
    -- server is likely overloaded
    -- failed to send out heartbeat on time (exceede...s)
    -- server is likely overloaded
    -- the clock difference against peer 7acd51f1bf7...s

  2. After installed the kubernetes, I run an example to test whether it was correctly installed, and it went as hoped;

  3. then a installed the helm, just as the guide, https://github.com/kubernetes/helm/blob/master/docs/install.md, informed. Then, when I run 'helm init' it got an error that said 'uid : unable to do port forwarding: socat not found'. I run 'yum install socat' on both nodes, and then the error had gone. But, after that, the command 'helm list' got another error said _"system:serviceaccount:kube-system:default" cannot get namespaces in the namespace "default" ", I gooled and run the code "kubectl create clusterrolebinding add-on-cluster-admin --clusterrole=cluster-admin --serviceaccount=kube-system:default", and it worked.
    Then I went on with "helm install"\model generation\creating image\pushing the image(seldon/irisclassifier) to docker hub, and they all went fine(with tiller installed, seldon-core pods). Utill

  4. When I run "kubectl apply -f sklearn_iris_deployment.json", the prompt said
    seldondeployment "seldon-deployment-example" created,
    however, when I run 'kubectl get pods', there was nothing sklearn--irisclassifier- related pod at all! I repeated quite a few times, and waited quite a period of time, and the result was the same! Then kubectl get pods --all-namespaces result are:

NAMESPACE NAME READY STATUS RESTARTS AGE
default redis-6668b544f4-qxmhc 1/1 Running 0 14m
default seldon-apiserver-587f6fbc4f-ncnzg 1/1 Running 0 14m
default seldon-cluster-manager-67cc67995b-rlhmj 1/1 Running 0 14m
kube-system heapster-55c5d9c56b-st2hb 1/1 Running 1 20h
kube-system kube-dns-778977457c-l2sq4 3/3 Running 3 21h
kube-system kubernetes-dashboard-7c5d596d8c-l7vc7 1/1 Running 1 20h
kube-system monitoring-grafana-5bccc9f786-nl298 1/1 Running 1 20h
kube-system monitoring-influxdb-85cb4985d4-nlb92 1/1 Running 1 20h
kube-system tiller-deploy-f44659b6c-g5g2v 1/1 Running 0 16m

  1. And finally, I compared the result of the command
    kubectl describe seldondeployments seldon-deployment-example
    with the one run on the minikube, and found the two results have a few differences:

the one on the minikube:

Name: seldon-deployment-example
Namespace: default
Labels: app=seldon
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"machinelearning.seldon.io/v1alpha1","kind":"SeldonDeployment","metadata":{"name":"seldon-deployment-example","namespace":"default","self...
API Version: machinelearning.seldon.io/v1alpha1
Kind: SeldonDeployment
Metadata:
Cluster Name:
Creation Timestamp: 2018-02-26T11:08:44Z
Generation: 0
Initializers:
Resource Version: 1727964
Self Link: /apis/machinelearning.seldon.io/v1alpha1/namespaces/default/seldondeployments/seldon-deployment-example
UID: 68c7a8e7-1ae5-11e8-8128-080027af5713
Spec:
Annotations:
Deployment _ Version: 0.1
Project _ Name: Iris classification
Name: sklearn-iris-deployment
Oauth _ Key: oauth-key
Oauth _ Secret: oauth-secret
Predictors:
Annotations:
Predictor _ Version: 0.1
Component Spec:
Metadata:
Labels:
Seldon - App: sklearn-iris-deployment
Spec:
Containers:
Env:
Name: PREDICTIVE_UNIT_SERVICE_PORT
Value: 9000
Name: PREDICTIVE_UNIT_PARAMETERS
Value: []
Name: PREDICTIVE_UNIT_ID
Value: sklearn-iris-classifier
Name: PREDICTOR_ID
Value: sklearn-iris-predictor
Name: SELDON_DEPLOYMENT_ID
Value: seldon-deployment-example
Image: seldonio/irisclassifier:0.1
Image Pull Policy: IfNotPresent
Lifecycle:
Pre Stop:
Exec:
Command:
/bin/sh
-c
/bin/sleep 5
Liveness Probe:
Handler:
Tcp Socket:
Port: http
Initial Delay Seconds: 10

Period Seconds: 5
Name: sklearn-iris-classifier
Ports:
Container Port: 9000
Name: http
Readiness Probe:
Handler:
Tcp Socket:
Port: http
Initial Delay Seconds: 10
Period Seconds: 5
Resources:
Requests:

Memory: 1Mi
Termination Grace Period Seconds: 20
Graph:
Endpoint:
Service _ Host: 0.0.0.0
Service _ Port: 9000

Type: REST
Name: sklearn-iris-classifier
Type: MODEL
Name: sklearn-iris-predictor
Replicas: 1
Status:
Predictor Status:
Name: sklearn-iris-deployment-sklearn-iris-predictor
Replicas: 2
Replicas Available: 0
Events:

the one on the kubernetes:

Name: seldon-deployment-example
Namespace: default
Labels: app=seldon
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"machinelearning.seldon.io/v1alpha1","kind":"SeldonDeployment","metadata":{"annotations":{},"labels":{"app":"seldon"},"name":"seldon-depl...
API Version: machinelearning.seldon.io/v1alpha1
Kind: SeldonDeployment
Metadata:
Cluster Name:
Creation Timestamp: 2018-03-16T07:06:58Z
Deletion Grace Period Seconds:
Deletion Timestamp:
Generation: 0
Initializers:
Resource Version: 7741
Self Link: /apis/machinelearning.seldon.io/v1alpha1/namespaces/default/seldondeployments/seldon-deployment-example
UID: 9dd818b1-28e8-11e8-a129-000c292ea0a2
Spec:
Annotations:
Deployment _ Version: 0.1
Project _ Name: Iris classification
Name: sklearn-iris-deployment
Oauth _ Key: oauth-key
Oauth _ Secret: oauth-secret
Predictors:
Annotations:
Predictor _ Version: 0.1
Component Spec:
Spec:
Containers:
Image: eric1991/irisclassifier:3.0
Image Pull Policy: IfNotPresent
Name: sklearn-iris-classifier
Resources:
Requests:
Memory: 1Mi
Termination Grace Period Seconds: 20
Graph:
Children:
Endpoint:
Type: REST
Name: sklearn-iris-classifier
Type: MODEL
Name: sklearn-iris-predictor
Replicas: 1
Events:

I tagged some of the differences(italic).

As far, this is my all tries on the issure. Thank you for your patience and your answer!

@zhangwei730
Copy link

@cliveseldon The long text posted before maybe a little messy, sorry !
And, the kubernetes example that used to verify the cluster is abount to create a web page that submit a string.
I wounder why the pod can not be created and there was no error, as if the cammand kubelete apply never worked.

@ukclivecox
Copy link
Contributor

Did you install seldon-core with RBAC enabled?, e.g.

helm install ../helm-charts/seldon-core --name seldon-core \
        --set cluster_manager.rbac=true \
        --set apife_service_type=LoadBalancer \
        --namespace seldon

can you check the logs of the cluster-manager. There will be errors if RBAC is not emabled and required on your cluster.

@zhangwei730
Copy link

@cliveseldon I checked the kube-cluster-manager log, and found that there are indeed several errors, but I am not sure whether it is RBAC related, so I post them here. And, indeed I did not install seldon-core with RBAC enabled.

----- actual_state_of_world.go:483] Failed to set statusUpdateNeeded to needed true because nodeName="node1" does not exist
----- Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeedd ....
----- Failed to set statusUpdateNeeded to needed true because nodeName="node2" does not exist
----- Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded
----- deployment_controller.go:483] Error syncing deployment kube-system/tiller-deploy: Operation cannot be fulfilled on deployments.
----- deployment_controller.go:483] Error syncing deployment kube-system/tiller-deploy: Operation cannot be fulfilled on deployments.
----- deployment_controller.go:483] Error syncing deployment default/seldon-apiserver: Operation cannot be fulfilled on deployments.
----- deployment_controller.go:483] Error syncing deployment default/seldon-cluster-manager: Operation cannot be fulfilled on deploym
----- Error syncing deployment default/redis: Operation cannot be fulfilled on deployments.extensions "

I noticed that there seems two kinds of errors, the one with the missing of kubernetes nodes, and the failing of seldon. So what do these two errors mean and how could I solve these two errors?

Your answer will be a tremendous help to me, thank you very much!

@zhangwei730
Copy link

@cliveseldon Missing one error:
---- leaderelection.go:224] error retrieving resource lock kube-system/kube-controller-manager: Get http://127.0.0.1:8080/api/v1/nam

@zhangwei730
Copy link

@cliveseldon Followed your guide, I sucessfully run thought the seldon model!

I am truly grateful to your help! Again, thank you so much!

@ukclivecox
Copy link
Contributor

@zhangwei730 Glad its working.
Feel free to ask questions on our slack channel as well.
Will close this now.

agrski pushed a commit that referenced this issue Dec 2, 2022
* adjust and fix tests for flattened versions

* tidy up state manager

* model_state.go cleanup

* re-instate getVersionsForAllModels

* fix model_state_test.go

* fix rproxy_grpc_test.go

* fixes after sorting build

* lint

* sperate lock in a different stuct

* fix tests

* remove LRU dep from state manager

* add a test for reload lock

* remove reload lock

* tidy up names

* add defer

* add logging

* add transaction for unload

* wait for item in case of of get and delete

* add a test to check state of models

* set log level to info for now

* fix failing test

* add state check to remaining tests

* wait for item lock outside of global mutex

* revert change as will be considered in another PR

* post merge fixes

* display diff in model states (test)

* clean up load model from versions

* tidy up failure in loadModel

* dont do remove model version in client.go

* move message to debug

* combine cache and tx implementation

* working without unloading vs evict

* state is inconsistent still (evict - unload)

* adding peek , state is ok

* fix lrucache test

* simplify apis

* simplify unload logic

* only hold write lock when  model is not in memory

* remove extra internal locks

* remove control plane lock!

* re introduce txmanager

* tidy up comments

* further tidy up

* fix lint

* minor fix log msgs

* fix lint

* add scheduler address and port

* tidy up post merge

* Review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants