Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fail to run example job. invalid job spec: tfReplicaSpec.TfPort can''t be nil #284

Closed
GoodJoey opened this issue Jan 11, 2018 · 11 comments · Fixed by #299
Closed

fail to run example job. invalid job spec: tfReplicaSpec.TfPort can''t be nil #284

GoodJoey opened this issue Jan 11, 2018 · 11 comments · Fixed by #299

Comments

@GoodJoey
Copy link

i tried kubectl create -f https://raw.githubusercontent.com/tensorflow/k8s/master/examples/tf_job.yaml
with command kubectl get TfJob -o yaml
get output:

  • apiVersion: tensorflow.org/v1alpha1
    kind: TfJob
    metadata:
    clusterName: ""
    creationTimestamp: 2018-01-11T08:34:46Z
    generation: 0
    name: example-job
    namespace: default
    resourceVersion: "30201"
    selfLink: /apis/tensorflow.org/v1alpha1/namespaces/default/tfjobs/example-job
    uid: 47e9ba9d-f6aa-11e7-baad-4ccc6ab8f7ad
    spec:
    RuntimeId: ""
    replicaSpecs:
    • IsDefaultPS: false
      replicas: 1
      template:
      metadata:
      creationTimestamp: null
      spec:
      containers:
      - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
      name: tensorflow
      resources: {}
      restartPolicy: OnFailure
      tfReplicaType: MASTER
    • IsDefaultPS: false
      replicas: 1
      template:
      metadata:
      creationTimestamp: null
      spec:
      containers:
      - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
      name: tensorflow
      resources: {}
      restartPolicy: OnFailure
      tfReplicaType: WORKER
    • IsDefaultPS: false
      replicas: 2
      tfReplicaType: PS
      tensorboard: null
      status:
      phase: Failed
      reason: 'invalid job spec: tfReplicaSpec.TfPort can''t be nil.'
      replicaStatuses: null
      state: Failed

anyone knows what's happening here? Thanks!

@gaocegege
Copy link
Member

It is a bug, and you could add TfPort into the YAML:

apiVersion: "tensorflow.org/v1alpha1"
kind: "TfJob"
metadata:
  name: "example-job"
spec:
  replicaSpecs:
    - replicas: 1
      tfReplicaType: MASTER
      TfPort: 2222
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: WORKER
      TfPort: 2222
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: PS
      TfPort: 2222

@GoodJoey
Copy link
Author

GoodJoey commented Jan 11, 2018

thanks for the response @gaocegege .
seems the error is gone, but i can't see the tfjob in my dashboard,
and no related containers are run in any of my nodes

@gaocegege
Copy link
Member

@GoodJoey Sorry for that, it seems that our code is broken now. I am not sure when all of them will be fixed. Maybe you could try it on the commit https://github.com/tensorflow/k8s/tree/430cf179ba9c1ce4a134d3800f871dbbb0c73da1

@GoodJoey
Copy link
Author

@gaocegege
i think the chart should be roll back, or do you mean i use the branch to build a new one?CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz

@gaocegege
Copy link
Member

emmmm, I didn't try chart, I mean the code in maste has some bugs: #289 #287 #286

Maybe you could checkout the commit 430cf17 and try again. I think it should work well.

@GoodJoey
Copy link
Author

you mean use commit 430cf17 to build a new tf_operator docker image, right?

@gaocegege
Copy link
Member

Yeah, I am not sure if it works but I think it does. I am asking @jlewi to release a stable version for users here: #280 (comment)

@GoodJoey
Copy link
Author

seems need to set up go environment? any quick scripts to do that?
or is there any existing image? sorry, i didn't build image tf_operator on my own before.

@gaocegege
Copy link
Member

gaocegege commented Jan 11, 2018

Sorry I also know little about the release process of the operator. If you are not urgent, maybe you could wait for these PRs, when they are merged I think the problems will be solved.

@jlewi
Copy link
Contributor

jlewi commented Jan 11, 2018

Sorry for the slow reply. The latest GCS link now points to an old stable release.

@jlewi jlewi mentioned this issue Jan 11, 2018
4 tasks
@jlewi
Copy link
Contributor

jlewi commented Jan 12, 2018

We should be setting a default port here which is called from some of the informer generated code.

Looks like that's not happening. Any ideas why?

/cc @gaocegege @wackxu

jlewi added a commit that referenced this issue Jan 12, 2018
* Need to register the default functions.
* In setup we need to invoke setting the defaults on the objects.
* This fixes a break introduced when we refactored the code.
* Fix #284
* Fix #297
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants