Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about tf-serving on NFS #844

Closed
gxfun opened this issue May 22, 2018 · 5 comments
Closed

Some questions about tf-serving on NFS #844

gxfun opened this issue May 22, 2018 · 5 comments

Comments

@gxfun
Copy link

gxfun commented May 22, 2018

Hi,
We cannot access the Internet and Ambassador isn't working . Will these affect the use of tf-serving?

We use kubeadm1.9.1 set up kubernetes.

kubernetes
master        iecas-30-6
slave           iecas-30-7, iecas-30-8

NFS
server         iecas-30-7
client          iecas-30-6, iecas-30-8

There is the information of inception-nfs service.

kubectl get deployment inception-nfs -n kubeflow

NAME            DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
inception-nfs   1         1         1            1           31m

kubectl get services -n kubeflow

NAME               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
ambassador         ClusterIP   10.99.217.131   <none>        80/TCP              32m
ambassador-admin   ClusterIP   10.103.24.16    <none>        8877/TCP            32m
inception-nfs      ClusterIP   10.105.9.96     <none>        9000/TCP,8000/TCP   32m
k8s-dashboard      ClusterIP   10.111.23.158   <none>        443/TCP             32m
tf-hub-0           ClusterIP   None            <none>        8000/TCP            32m
tf-hub-lb          ClusterIP   10.98.150.141   <none>        80/TCP              32m
tf-job-dashboard   ClusterIP   10.110.154.14   <none>        80/TCP              32m

We can see the EXTERNAL-IP is none.

kubectl logs inception-nfs-657769bbd5-w4cv2 -n kubeflow

2018-05-21 18:42:33.129402: E tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:370] FileSystemStoragePathSource encountered a file-system access error: Could not find base path /mnt/var/nfs/general/inception for servable inception-nfs    

The error is Could not find base path /mnt/var/nfs/general/inception, but it exists in /var/nfs/general/inception.

iecas@iecas-30-7: ll /var/nfs/general/

total 16
drwxr-xr-x 4 nobody nogroup 4096 5月  22 01:17 ./
drwxr-xr-x 3 root   root    4096 3月   4  2016 ../
-rw-r--r-- 1 nobody nogroup    0 3月   4  2016 general.test
drwxr-xr-x 3 root   root    4096 5月  22 01:17 inception/
drwxr-xr-x 2 root   root    4096 3月   4  2016 pip/
kubectl describe pod  inception-nfs-657769bbd5-w4cv2 -n kubeflow

Name:           inception-nfs-657769bbd5-w4cv2
Namespace:      kubeflow
Node:           iecas-30-8/192.168.30.8
Start Time:     Tue, 22 May 2018 02:03:44 +0800
Labels:         app=inception-nfs
                pod-template-hash=2133256681
Annotations:    <none>
Status:         Running
IP:             10.244.1.14
Controlled By:  ReplicaSet/inception-nfs-657769bbd5
Containers:
  inception-nfs:
    Container ID:  docker://5acfa1a67310929575ab65e89ca482106d088c9cf3ecee4e64710b26d538c930
    Image:         gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec
    Image ID:      docker://sha256:aeb4fbd2c5a15d0714054153556e6e445a1bbb8fcbac7b289467bb328025d9db
    Port:          9000/TCP
    Args:
      /usr/bin/tensorflow_model_server
      --port=9000
      --model_name=inception-nfs
      --model_base_path=/mnt/var/nfs/general/inception
    State:          Running
      Started:      Tue, 22 May 2018 02:07:11 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     4
      memory:  4Gi
    Requests:
      cpu:        1
      memory:     1Gi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-kw2s8 (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          True 
  PodScheduled   True 
Volumes:
  default-token-kw2s8:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-kw2s8
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                 Age                From                 Message
  ----     ------                 ----               ----                 -------
  Normal   SuccessfulMountVolume  44m                kubelet, iecas-30-8  MountVolume.SetUp succeeded for volume "default-token-kw2s8"
  Warning  Failed                 44m                kubelet, iecas-30-8  Failed to pull image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec": rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/: dial tcp: lookup gcr.io on [::1]:53: read udp [::1]:51709->[::1]:53: read: connection refused
  Warning  Failed                 43m                kubelet, iecas-30-8  Failed to pull image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec": rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/: dial tcp: lookup gcr.io on [::1]:53: read udp [::1]:53334->[::1]:53: read: connection refused
  Warning  Failed                 43m                kubelet, iecas-30-8  Failed to pull image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec": rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/: dial tcp: lookup gcr.io on [::1]:53: read udp [::1]:34904->[::1]:53: read: connection refused
  Warning  Failed                 42m (x4 over 44m)  kubelet, iecas-30-8  Error: ErrImagePull
  Normal   Pulling                42m (x4 over 44m)  kubelet, iecas-30-8  pulling image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec"
  Warning  Failed                 42m                kubelet, iecas-30-8  Failed to pull image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec": rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/: dial tcp: lookup gcr.io on [::1]:53: read udp [::1]:60235->[::1]:53: read: connection refused
  Normal   BackOff                42m (x6 over 44m)  kubelet, iecas-30-8  Back-off pulling image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec"
  Warning  Failed                 42m (x6 over 44m)  kubelet, iecas-30-8  Error: ImagePullBackOff
  Normal   Scheduled              41m                default-scheduler    Successfully assigned inception-nfs-657769bbd5-w4cv2 to iecas-30-8
iecas@iecas-30-6: kubectl edit service tf-job-dashboard -n kubeflow

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
kind: Service
metadata:
  annotations:
    getambassador.io/config: |-
      ---
      apiVersion: ambassador/v0
      kind:  Mapping
      name: tfjobs-ui-mapping
      prefix: /tfjobs/
      rewrite: /tfjobs/
      service: tf-job-dashboard.kubeflow
  creationTimestamp: 2018-05-21T18:05:41Z
  name: tf-job-dashboard
  namespace: kubeflow
  resourceVersion: "1750"
  selfLink: /api/v1/namespaces/kubeflow/services/tf-job-dashboard
  uid: 9320f5b3-5d21-11e8-9f7b-a0423f2e7641
spec:
  clusterIP: 10.110.154.14
  ports:
  - port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    name: tf-job-dashboard
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
iecas@iecas-30-6:~/Documents/kubeflow/code/my-kubeflow$ kubectl get nodes

NAME         STATUS    ROLES     AGE       VERSION
iecas-30-6   Ready     master    54m       v1.9.1
iecas-30-7   Ready     <none>    49m       v1.9.1
iecas-30-8   Ready     <none>    51m       v1.9.1

iecas@iecas-30-6:~/Documents/kubeflow/code/my-kubeflow$ kubectl get pods --all-namespaces

NAMESPACE     NAME                                   READY     STATUS             RESTARTS   AGE
kube-system   etcd-iecas-30-6                        1/1       Running            0          53m
kube-system   kube-apiserver-iecas-30-6              1/1       Running            0          53m
kube-system   kube-controller-manager-iecas-30-6     1/1       Running            0          53m
kube-system   kube-dns-6f4fd4bdf-lbg2w               3/3       Running            0          54m
kube-system   kube-flannel-ds-8nzkh                  1/1       Running            0          52m
kube-system   kube-flannel-ds-f4q5h                  1/1       Running            0          51m
kube-system   kube-flannel-ds-hg449                  1/1       Running            0          50m
kube-system   kube-proxy-dfgtr                       1/1       Running            0          51m
kube-system   kube-proxy-nfqtb                       1/1       Running            0          50m
kube-system   kube-proxy-xdx2t                       1/1       Running            0          54m
kube-system   kube-scheduler-iecas-30-6              1/1       Running            0          53m
kube-system   nvidia-device-plugin-daemonset-h87m9   1/1       Running            0          50m
kube-system   nvidia-device-plugin-daemonset-mpvzg   1/1       Running            0          50m
kubeflow      ambassador-64dcb6694f-qnvvk            1/2       CrashLoopBackOff   11         38m
kubeflow      ambassador-6dffffbc5c-9vb59            1/2       CrashLoopBackOff   11         37m
kubeflow      ambassador-6dffffbc5c-qh2qj            1/2       CrashLoopBackOff   5          6m
kubeflow      ambassador-6dffffbc5c-w2gk9            1/2       CrashLoopBackOff   11         37m
kubeflow      inception-nfs-657769bbd5-w4cv2         1/1       Running            0          37m
kubeflow      spartakus-volunteer-66564f9679-s4gjn   1/1       Running            0          37m
kubeflow      tf-hub-0                               1/1       Running            0          37m
kubeflow      tf-job-dashboard-7d48f6456c-hd6n8      1/1       Running            0          38m
kubeflow      tf-job-operator-68cd79c8b5-rpxlp       1/1       Running            0          38m

Thanks!

@gxfun
Copy link
Author

gxfun commented May 22, 2018

The reason for this error Could not find base path /mnt/var/nfs/general/inception, but it exists in /var/nfs/general/inception is that the version of kubeflow that we use is v0.1.2. Now we update it to master.

There is another error:

 kubectl describe pod  inception-nfs-v1-7d4bfc4d59-czbmn -n kubeflow

Name:           inception-nfs-v1-7d4bfc4d59-czbmn
Namespace:      kubeflow
Node:           iecas-30-8/192.168.30.8
Start Time:     Wed, 23 May 2018 01:26:10 +0800
Labels:         app=inception-nfs
                pod-template-hash=3806970815
                version=v1
Annotations:    <none>
Status:         Pending
IP:             
Controlled By:  ReplicaSet/inception-nfs-v1-7d4bfc4d59
Containers:
  inception-nfs:
    Container ID:  
    Image:         gcr.io/kubeflow-images-public/tf-model-server-cpu:v20180327-995786ec
    Image ID:      
    Port:          9000/TCP
    Args:
      /usr/bin/tensorflow_model_server
      --port=9000
      --model_name=inception-nfs
      --model_base_path=/mnt/var/nfs/general/inception
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     4
      memory:  4Gi
    Requests:
      cpu:        1
      memory:     1Gi
    Environment:  <none>
    Mounts:
      /mnt from nfs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-kw2s8 (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  nfs:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  nfs
    ReadOnly:   false
  default-token-kw2s8:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-kw2s8
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                 Age   From                 Message
  ----     ------                 ----  ----                 -------
  Normal   SuccessfulMountVolume  6m    kubelet, iecas-30-8  MountVolume.SetUp succeeded for volume "default-token-kw2s8"
  Warning  FailedMount            4m    kubelet, iecas-30-8  Unable to mount volumes for pod "inception-nfs-v1-7d4bfc4d59-czbmn_kubeflow(95f9d6a5-5de5-11e8-9f7b-a0423f2e7641)": timeout expired waiting for volumes to attach/mount for pod "kubeflow"/"inception-nfs-v1-7d4bfc4d59-czbmn". list of unattached/unmounted volumes=[nfs]
  Normal   Scheduled              3m    default-scheduler    Successfully assigned inception-nfs-v1-7d4bfc4d59-czbmn to iecas-30-8

@jlewi
Copy link
Contributor

jlewi commented May 22, 2018

It looks like there is a problem mounting your NFS persistent volume. This doesn't look like an issue with Kubeflow but with your cluster/PV configuration.

@gxfun
Copy link
Author

gxfun commented May 23, 2018

@jlewi
@lqj679ssn
Thanks.

I follow the steps in Serve a local model using Tensorflow Serving.md. Can you give me some advice on this issue?

@jlewi
Copy link
Contributor

jlewi commented May 23, 2018

Its hard for me to tell what's going on from all your replies because I can't tell which logs go with what.

In your first comment you pasted the following

kubectl describe pod  inception-nfs-657769bbd5-w4cv2 -n kubeflow
...
Events:
  Type     Reason                 Age                From                 Message
  ----     ------                 ----               ----                 -------
  Normal   SuccessfulMountVolume  44m                kubelet, iecas-30-8  MountVolume.SetUp succeeded for volume "default-token-kw2s8"
  Warning  Failed                 44m                kubelet, iecas-30-8  Failed to pull image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec": rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/: dial tcp: lookup gcr.io on [::1]:53: read udp [::1]:51709->[::1]:53: read: connection refused
  Warning  Failed                 43m                kubelet, iecas-30-8  Failed to pull image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec": rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/: dial tcp: lookup gcr.io on [::1]:53: read udp [::1]:53334->[::1]:53: read: connection refused
  Warning  Failed                 43m                kubelet, iecas-30-8  Failed to pull image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec": rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/: dial tcp: lookup gcr.io on [::1]:53: read udp [::1]:34904->[::1]:53: read: connection refused
  Warning  Failed                 42m (x4 over 44m)  kubelet, iecas-30-8  Error: ErrImagePull
  Normal   Pulling                42m (x4 over 44m)  kubelet, iecas-30-8  pulling image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec"
  Warning  Failed                 42m                kubelet, iecas-30-8  Failed to pull image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec": rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/: dial tcp: lookup gcr.io on [::1]:53: read udp [::1]:60235->[::1]:53: read: connection refused
  Normal   BackOff                42m (x6 over 44m)  kubelet, iecas-30-8  Back-off pulling image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec"
  Warning  Failed                 42m (x6 over 44m)  kubelet, iecas-30-8  Error: ImagePullBackOff
  Normal   Scheduled              41m                default-scheduler    Successfully assigned inception-nfs-657769bbd5-w4cv2 to iecas-30-8

This indicates a problem pulling the docker image for TF Serving.

If you're having trouble accessing the internet then that could explain why you aren't able to start the pods. Although it looks like you have other Kubeflow processes that are running.

Verify you can access the image e.g.

docker pull gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec

Then try to launch TF serving and then look at the result of kubectl describe pods <POD> to see what's going on with the pod.

@jlewi
Copy link
Contributor

jlewi commented Jun 1, 2018

/label question
Closing this issue since its been dormant for 10 days. Please reopen if your issue is still unresolved.

@jlewi jlewi closed this as completed Jun 1, 2018
yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants