Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TfJob should be marked as failed if setup fails #218

Closed
jlewi opened this issue Dec 12, 2017 · 1 comment
Closed

TfJob should be marked as failed if setup fails #218

jlewi opened this issue Dec 12, 2017 · 1 comment
Labels

Comments

@jlewi
Copy link
Contributor

jlewi commented Dec 12, 2017

Context #203

  • Controller pod has been running for 8 days
  • I have submitted multiple TfJobs that started successfully.
  • Currently there is a single TfJob (see tfjobs.yaml.txt)
    • The TfJob doesn't have runtimeId which indicates it hasn't been updated by the controller.
  • Log for the [controller.log.txt]controller.log.txt shows
    • Added event is recieved

      I1211 21:40:38.516560       1 controller.go:349] event: ADDED {
      ...
      E1211 21:40:38.530912       1 training.go:112] TfJob failed to setup: tbReplicaSpec.LogDir must be 
         specified
      
    I1211 21:40:38.706580 1 controller.go:349] event: MODIFIED {
    
    
  • So I think there is at least one bug with training.go:112 not updating the TfJob status to indicate that setup failed.
  • When I specified LogDir the job was successfully created.

tfjobs.yaml.txt

Ideally we should have an E2E test to verify the correct behavior.

@jlewi
Copy link
Contributor Author

jlewi commented Dec 12, 2017

So I can repo this using the following spec:

apiVersion: "tensorflow.org/v1alpha1"
kind: "TfJob"
metadata:
  name: "invalid-job-spec"
spec:
  tensorboard:
    logDir: 
  replicaSpecs:
    - replicas: 1
      tfPort: 2222
      tfReplicaType: MASTER
      template:
        spec:
          containers:
            - image: gcr.io/some/noneexistant-image:latest
              name: tensorflow
              command:
                - python
                - main.py
                - --log-dir
                - /tmp/tensorflow/
          restartPolicy: OnFailure

Which results in the following job

apiVersion: tensorflow.org/v1alpha1
kind: TfJob
metadata:
  clusterName: ""
  creationTimestamp: 2017-12-12T19:22:13Z
  generation: 0
  name: invalid-job-spec
  namespace: default
  resourceVersion: "3498435"
  selfLink: /apis/tensorflow.org/v1alpha1/namespaces/default/tfjobs/invalid-job-spec
  uid: c20fec99-df71-11e7-b6b9-42010af0014d
spec:
  RuntimeId: ""
  replicaSpecs:
  - IsDefaultPS: false
    replicas: 1
    template:
      metadata:
        creationTimestamp: null
      spec:
        containers:
        - command:
          - python
          - main.py
          - --log-dir
          - /tmp/tensorflow/
          image: gcr.io/some/noneexistant-image:latest
          name: tensorflow
          resources: {}
        restartPolicy: OnFailure
    tfPort: 2222
    tfReplicaType: MASTER
  tensorboard:
    logDir: ""
    serviceType: ""
    volumeMounts: null
    volumes: null
  tfImage: tensorflow/tensorflow:1.3.0
status:
  conditions: null
  controlPaused: false
  phase: Failed
  reason: tbReplicaSpec.LogDir must be specified
  replicaStatuses: null
  state: ""

The state isn't set but it should be set to failed.

jlewi added a commit that referenced this issue Dec 15, 2017
* Fixes #218
* If a job can't be initialized (e.g. because the spec is invalid) the state should be set to failed.
* I refactored the code so that in the event that the call to setup fails, it will end up getting retried as part of the next attempt at reconcilation
* I moved the call to setup into run
* We also refactored the code for reconciling the state into a separate function reconcile. I did this because we want to call reconcile when run is first invoked and not wait for the reconcile interval to elapse.
* Delete the method triggerCreatePhase; I moved the relevant code for setting the phase into setup.
* Delete some code that is a legacy of the etcd-operator and no longer being used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant