Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while replicating mnist_with_summaries #1159

Closed
panchul opened this issue May 13, 2020 · 3 comments
Closed

Error while replicating mnist_with_summaries #1159

panchul opened this issue May 13, 2020 · 3 comments

Comments

@panchul
Copy link

panchul commented May 13, 2020

Hello,

I wanted to run the examples/v1/mnist_with_summaries code on Azure.
TFJob was created ok, but the pod stays Pending state. I believe there is an issue with the storage. After running .yamls in tfevent-volume, persistent volume is defined, but not Bound, and ultimately the worker pod never starts.

$ k -n kubeflow get pv tfevent-volume
NAME             CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
tfevent-volume   10Gi       RWX            Retain           Available           standard                22m

The pvc for tfevent-volume stays in Pending state too:

$ k -n kubeflow get pvc
NAME             STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
...
tfevent-volume   Pending                                                                        default        19m
...

And there seem to be errors with the access mode:

$ k -n kubeflow describe pvc tfevent-volume
Name:          tfevent-volume
Namespace:     kubeflow
StorageClass:  default
Status:        Pending
Volume:
Labels:        app=tfjob
               type=local
Annotations:   volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/azure-disk
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Mounted By:    mnist-tf-worker-0
Events:
  Type     Reason              Age                  From                         Message
  ----     ------              ----                 ----                         -------
  Warning  ProvisioningFailed  116s (x13 over 14m)  persistentvolume-controller  Failed to provision volume with StorageClass "default": invalid AccessModes [ReadWriteMany]: only AccessModes [ReadWriteOnce] are supported

Here are the tfevent-volume/tfevent-pvc.yaml and tfevent-volume/tfevent-pv.yaml(same as in the repository):

$ cat tfevent-volume/tfevent-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: tfevent-volume
  labels:
    type: local
    app: tfjob
spec:
  capacity:
    storage: 10Gi
  storageClassName: standard
  accessModes:
    - ReadWriteMany
  hostPath:
    path: /tmp/data

$ cat tfevent-volume/tfevent-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: tfevent-volume
  namespace: kubeflow
  labels:
    type: local
    app: tfjob
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.85
area/tfjob 0.87

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@gaocegege
Copy link
Member

Maybe you need to have another storageclass to support ReadWriteMany PV

@stale
Copy link

stale bot commented Aug 12, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot closed this as completed Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants