Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connect timed out with cluster-manager and api server #523

Closed
kaysonx opened this issue Apr 24, 2019 · 4 comments
Closed

connect timed out with cluster-manager and api server #523

kaysonx opened this issue Apr 24, 2019 · 4 comments

Comments

@kaysonx
Copy link

kaysonx commented Apr 24, 2019

I followed the instruction with helm to install the seldon core, but get follow error:

Unexpected error trying to create CRD with:
io.kubernetes.client.ApiException: java.net.SocketTimeoutException: connect timed out

Failed to instantiate [io.seldon.clustermanager.k8s.SeldonDeploymentWatcher]: Constructor threw exception; nested exception is io.kubernetes.client.ApiException: java.net.SocketTimeoutException: connect timed out

and I've checked the role/rolebinding/sa, it's the excepted with helm charts.

Any suggestions folks?

BTW, the command I use is:

helm install ./seldon-core-crd-0.2.6.tgz  --name seldon-core-crd  --set usage_metrics.enabled=true

helm install ./seldon-core-0.2.6.tgz  --name seldon-core  --namespace seldon \
--set apife.image.name=my-private-registry/apife:0.2.6 \
--set cluster_manager.image.name=my-private-registry/cluster-manager:0.2.6 \
--set engine.image.name=my-private-registry/engine:0.2.6 \
--set redis.image.name=my-private-registry/redis:4.0.1
@ukclivecox
Copy link
Contributor

This seems to suggest the cluster-manager pod can't connect to the k8s API. Is there anything special about the cluster you are running on? Does it allow k8s API access from your namespace or does the RBAC of your cluster disallow this?

@kaysonx
Copy link
Author

kaysonx commented Apr 24, 2019

Thanks your reply!

The cluster is ok, we already have a customer scheduler with client-go sdk writing and running on it.
Just check the pod, the token is mount rightly with:
/var/run/secrets/kubernetes.io/serviceaccount from seldon-token-mwcpb

and the corresponding role is:

apiVersion: v1
items:
- apiVersion: rbac.authorization.k8s.io/v1
  kind: Role
  metadata:
    creationTimestamp: 2019-04-24T09:34:00Z
    name: seldon-local
    namespace: seldon
    resourceVersion: "40929244"
    selfLink: /apis/rbac.authorization.k8s.io/v1/namespaces/seldon/roles/seldon-local
    uid: 176d39f6-6674-11e9-a41d-70106fba9de6
  rules:
  - apiGroups:
    - '*'
    resources:
    - deployments
    - services
    verbs:
    - '*'
  - apiGroups:
    - machinelearning.seldon.io
    resources:
    - '*'
    verbs:
    - '*'
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

I guess the k8s java sdk using this mount token: /var/run/secrets/kubernetes.io/serviceaccount
Is that right? or the role need more permission with.

@ukclivecox
Copy link
Contributor

Yes it will be using seldon-local RBAC and the default token that k8s adds to the pod. Is your cluster setup with any restrictions?

@ukclivecox
Copy link
Contributor

Please reopen if still an issue on 0.4.0 release.

agrski pushed a commit that referenced this issue Dec 2, 2022
* refactor method name

* add conn close

* add signal handler

* change const name

* add msg for agent

* lint

* remove const

* agent protos for new message

* send drain event to scheduler

* http drainer server

* add test

* wire up drainer service in agent

* fix test

* fix flaky test

* add dummy handler on scheduler

* add env variable

* tidy up envs for compose

* refactor variable name

* fix lint

* add scheduler logic for drain

* add model waiter helper

* Add test for model waiter

* signal model

* adjust test

* add Draining state

* add isDraining state to Server Replica

* update memory to mark server replica as draining

* fix state updates with draining

* add Draining to function return

* add filter for server is draining

* add scheduler test

* lint

* add extra wait

* skip draining

* prefer available replicas over draining

* wait on other replica is available

* add note

* update stats with draining

* reviews

* update to model state if schedulefailed previously

* restore state-> string test

* PR reviews
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants