Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can not set labels #580

Closed
Farrellow opened this issue May 9, 2018 · 17 comments
Closed

can not set labels #580

Farrellow opened this issue May 9, 2018 · 17 comments

Comments

@Farrellow
Copy link

Farrellow commented May 9, 2018

I set labels in yaml file, and kubectl apply -f yaml file.but the pod template has no the value in labels.
the yaml file is like this:

kind: "TFJob"
metadata:
  name: "job9"
spec:
  replicaSpecs:
    - replicas: 1
      tfReplicaType: MASTER
      template:
        metadata:
          labels:
            gpu_type: GTX1080TI
        spec:
          affinity:
              nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                      nodeSelectorTerms:
                          - matchExpressions:
                              - key: kubernetes.io/hostname
                                operator: In
                                values:
                                    - 110-255-0-78
          containers:
            - image: tensorflow/tensorflow:latest-gpu
              name: tensorflow
              command: ["python3", "/data/tf_smoke.py"]
              resources:
                  limits:
                      cpu: 1.0
                      memory: 1.0Gi
              volumeMounts:
                  - mountPath: /data
                    name: code
                  - mountPath: /dev/shm
                    name: shm
                  - mountPath: /usr/local/nvidia
                    name: nvidia-driver
          restartPolicy: OnFailure
          volumes:
              - hostPath:
                    path: /data
                name: code
              - hostPath:
                    path: /dev/shm
                name: shm
              - hostPath:
                    path: /var/lib/nvidia-docker/volumes/nvidia_driver/375.66
                name: nvidia-driver
          tolerations:
              - key: CriticalAddonsOnly
                operator: Exists
    - replicas: 1
      tfReplicaType: WORKER
      template:
        metadata:
          labels:
            gpu_type: GTX1080TI
        spec:
          affinity:
              nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                      nodeSelectorTerms:
                          - matchExpressions:
                              - key: kubernetes.io/hostname
                                operator: In
                                values:
                                    - 110-255-0-78
          containers:
            - image: tensorflow/tensorflow:latest-gpu
              name: tensorflow
              command: ["python3", "/data/tf_smoke.py"]
              resources:
                  limits:
                      cpu: 1.0
                      memory: 1.0Gi
                      alpha.kubernetes.io/nvidia-gpu: 1
              volumeMounts:
                  - mountPath: /data
                    name: code
                  - mountPath: /dev/shm
                    name: shm
                  - mountPath: /usr/local/nvidia
                    name: nvidia-driver
          restartPolicy: OnFailure
          volumes:
              - hostPath:
                    path: /data
                name: code
              - hostPath:
                    path: /dev/shm
                name: shm
              - hostPath:
                    path: /var/lib/nvidia-docker/volumes/nvidia_driver/375.66
                name: nvidia-driver
          tolerations:
              - key: CriticalAddonsOnly
                operator: Exists
    - replicas: 1
      tfReplicaType: PS
      template:
        metadata:
          labels:
            gpu_type: GTX1080TI
        spec:
          affinity:
              nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                      nodeSelectorTerms:
                          - matchExpressions:
                              - key: kubernetes.io/hostname
                                operator: In
                                values:
                                    - 110-255-0-78
          containers:
            - image: tensorflow/tensorflow:latest-gpu
              name: tensorflow
              command: ["python3", "/data/tf_smoke.py"]
              resources:
                  limits:
                      cpu: 1.0
                      memory: 1.0Gi
              volumeMounts:
                  - mountPath: /data
                    name: code
                  - mountPath: /dev/shm
                    name: shm
                  - mountPath: /usr/local/nvidia
                    name: nvidia-driver
          restartPolicy: OnFailure
          volumes:
              - hostPath:
                    path: /data
                name: code
              - hostPath:
                    path: /dev/shm
                name: shm
              - hostPath:
                    path: /var/lib/nvidia-docker/volumes/nvidia_driver/375.66
                name: nvidia-driver
          tolerations:
              - key: CriticalAddonsOnly
                operator: Exists```

but the labels of pod which belonged to tfjob named "job9" dose have the key-value: gpu_type-GTX1080TI
@gaocegege
Copy link
Member

Please add apiVersion: "kubeflow.org/v1alpha1" before the config and try again.

Please have a look at https://github.com/kubeflow/tf-operator/blob/master/examples/tf_job.yaml

@Farrellow
Copy link
Author

sorry, my fault.
my yaml file is actually like this:

apiVersion: "kubeflow.org/v1alpha1"
kind: "TFJob"
metadata:
  name: "job10"
spec:
  replicaSpecs:
    - replicas: 1
      tfReplicaType: MASTER
      template:
        metadata:
          labels:
            gpu_type: GTX1080TI
        spec:
          affinity:
              nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                      nodeSelectorTerms:
                          - matchExpressions:
                              - key: kubernetes.io/hostname
                                operator: In
                                values:
                                    - 110-255-0-78
          containers:
            - image: tensorflow/tensorflow:latest-gpu
              name: tensorflow
              command: ["python3", "/data/tf_smoke.py"]
              resources:
                  limits:
                      cpu: 1.0
                      memory: 1.0Gi
              volumeMounts:
                  - mountPath: /data
                    name: code
                  - mountPath: /dev/shm
                    name: shm
                  - mountPath: /usr/local/nvidia
                    name: nvidia-driver
          restartPolicy: OnFailure
          volumes:
              - hostPath:
                    path: /data
                name: code
              - hostPath:
                    path: /dev/shm
                name: shm
              - hostPath:
                    path: /var/lib/nvidia-docker/volumes/nvidia_driver/375.66
                name: nvidia-driver
          tolerations:
              - key: CriticalAddonsOnly
                operator: Exists
    - replicas: 1
      tfReplicaType: WORKER
      template:
        metadata:
          labels:
            gpu_type: GTX1080TI
        spec:
          affinity:
              nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                      nodeSelectorTerms:
                          - matchExpressions:
                              - key: kubernetes.io/hostname
                                operator: In
                                values:
                                    - 110-255-0-78
          containers:
            - image: tensorflow/tensorflow:latest-gpu
              name: tensorflow
              command: ["python3", "/data/tf_smoke.py"]
              resources:
                  limits:
                      cpu: 1.0
                      memory: 1.0Gi
                      alpha.kubernetes.io/nvidia-gpu: 1
              volumeMounts:
                  - mountPath: /data
                    name: code
                  - mountPath: /dev/shm
                    name: shm
                  - mountPath: /usr/local/nvidia
                    name: nvidia-driver
          restartPolicy: OnFailure
          volumes:
              - hostPath:
                    path: /data
                name: code
              - hostPath:
                    path: /dev/shm
                name: shm
              - hostPath:
                    path: /var/lib/nvidia-docker/volumes/nvidia_driver/375.66
                name: nvidia-driver
          tolerations:
              - key: CriticalAddonsOnly
                operator: Exists
    - replicas: 1
      tfReplicaType: PS
      template:
        metadata:
          labels:
            gpu_type: GTX1080TI
        spec:
          affinity:
              nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                      nodeSelectorTerms:
                          - matchExpressions:
                              - key: kubernetes.io/hostname
                                operator: In
                                values:
                                    - 110-255-0-78
          containers:
            - image: tensorflow/tensorflow:latest-gpu
              name: tensorflow
              command: ["python3", "/data/tf_smoke.py"]
              resources:
                  limits:
                      cpu: 1.0
                      memory: 1.0Gi
              volumeMounts:
                  - mountPath: /data
                    name: code
                  - mountPath: /dev/shm
                    name: shm
                  - mountPath: /usr/local/nvidia
                    name: nvidia-driver
          restartPolicy: OnFailure
          volumes:
              - hostPath:
                    path: /data
                name: code
              - hostPath:
                    path: /dev/shm
                name: shm
              - hostPath:
                    path: /var/lib/nvidia-docker/volumes/nvidia_driver/375.66
                name: nvidia-driver
          tolerations:
              - key: CriticalAddonsOnly
                operator: Exists

I was so careless that miss the apiVersion: "kubeflow.org/v1alpha1" in issue.

but it still can not set labels in pod

@gaocegege
Copy link
Member

Ref #542

We should implemented the feature before.

@u2takey Do you have any thought about the issue?

@gaocegege
Copy link
Member

Do you mean you set the labels in pod template and the pods created by the operator does not have these labels, right?

@u2takey
Copy link
Contributor

u2takey commented May 10, 2018

If you are using master/tf-operator(v1), missing labels in pod template is fixed by #542 @Farrellow

@Farrellow
Copy link
Author

@gaocegege yes, it is.

@gaocegege
Copy link
Member

Do you use the latest version? Or you create the instrance from kubeflow/kubeflow?

@Farrellow
Copy link
Author

Farrellow commented May 10, 2018

@gaocegege I used ks to install kubeflow followed by this article

ks init my-kubeflow
cd my-kubeflow/
VERSION=v0.1.2
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
ks pkg install kubeflow/core@${VERSION}
ks pkg install kubeflow/tf-serving@${VERSION}
ks pkg install kubeflow/tf-job@${VERSION}
ks generate core kubeflow-core --name=kubeflow-core
KF_ENV=nocloud
NAMESPACE=kubeflow
kubectl create namespace ${NAMESPACE}
ks env set ${KF_ENV} --namespace ${NAMESPACE}
ks apply ${KF_ENV} -c kubeflow-core

my version is v0.1.2

@gaocegege
Copy link
Member

OK, I found the cause. The commit used in v0.1.2 is a7511ff, which is not included the feature commit. Then you could not benefit from it.

I think maybe you could have a try to run the oprator on your own in the latest version.

@Farrellow
Copy link
Author

Farrellow commented May 10, 2018

@gaocegege ok, I will try to upgrade kubeflow, thank you very much.

@gaocegege
Copy link
Member

Don't be so kind 😄

BTW, the newest version v0.1.3 also uses a7511ff. Thus I suggest building and running the operator on your own.

@Farrellow
Copy link
Author

I install the latest version kubeflow in kubernetes cluster, but it dosen't work. I still can not set the labels.
I am not sure whether I upgraded the kubeflow successfully. how to check the version of kubeflow?

$ kubectl describe crd tfjobs.kubeflow.org
Name:         tfjobs.kubeflow.org
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  apiextensions.k8s.io/v1beta1
Kind:         CustomResourceDefinition
Metadata:
  Creation Timestamp:  2018-05-08T02:10:19Z
  Generation:          1
  Resource Version:    6751029
  Self Link:           /apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/tfjobs.kubeflow.org
  UID:                 f4cc8e6a-5264-11e8-b130-0cc47a2a4e30
Spec:
  Group:  kubeflow.org
  Names:
    Kind:       TFJob
    List Kind:  TFJobList
    Plural:     tfjobs
    Singular:   tfjob
  Scope:        Namespaced
  Version:      v1alpha1
Status:
  Accepted Names:
    Kind:       TFJob
    List Kind:  TFJobList
    Plural:     tfjobs
    Singular:   tfjob
  Conditions:
    Last Transition Time:  2018-05-08T02:10:19Z
    Message:               no conflicts found
    Reason:                NoConflicts
    Status:                True
    Type:                  NamesAccepted
    Last Transition Time:  2018-05-08T02:10:19Z
    Message:               the initial names have been accepted
    Reason:                InitialNamesAccepted
    Status:                True
    Type:                  Established
Events:                    <none>

@gaocegege
Copy link
Member

Yeah, the latest version of kubeflow also uses a7511ff.

Thus I suggest building and running the operator on your own.

@Farrellow
Copy link
Author

what do you mean of building and running the operator on my own, you mean that the latest version of kubeflow in github master branch dosen't merge this commit a7511ff?

@gaocegege
Copy link
Member

I mean the version used in kubeflow does not have the enhancement change.

@Farrellow
Copy link
Author

how to build and run the operator on my own. I don't know how to use ksonnet to install tf-operator on my own.

@gaocegege
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants