TfJob should be marked as failed if setup fails #218

jlewi · 2017-12-12T19:15:19Z

Context #203

Controller pod has been running for 8 days
I have submitted multiple TfJobs that started successfully.
Currently there is a single TfJob (see tfjobs.yaml.txt)
- The TfJob doesn't have runtimeId which indicates it hasn't been updated by the controller.

Log for the [controller.log.txt]controller.log.txt shows

Added event is recieved

I1211 21:40:38.516560       1 controller.go:349] event: ADDED {
...
E1211 21:40:38.530912       1 training.go:112] TfJob failed to setup: tbReplicaSpec.LogDir must be 
   specified

I1211 21:40:38.706580 1 controller.go:349] event: MODIFIED {

So I think there is at least one bug with training.go:112 not updating the TfJob status to indicate that setup failed.
When I specified LogDir the job was successfully created.

tfjobs.yaml.txt

Ideally we should have an E2E test to verify the correct behavior.

The text was updated successfully, but these errors were encountered:

jlewi · 2017-12-12T19:30:42Z

So I can repo this using the following spec:

apiVersion: "tensorflow.org/v1alpha1"
kind: "TfJob"
metadata:
  name: "invalid-job-spec"
spec:
  tensorboard:
    logDir: 
  replicaSpecs:
    - replicas: 1
      tfPort: 2222
      tfReplicaType: MASTER
      template:
        spec:
          containers:
            - image: gcr.io/some/noneexistant-image:latest
              name: tensorflow
              command:
                - python
                - main.py
                - --log-dir
                - /tmp/tensorflow/
          restartPolicy: OnFailure

Which results in the following job

apiVersion: tensorflow.org/v1alpha1
kind: TfJob
metadata:
  clusterName: ""
  creationTimestamp: 2017-12-12T19:22:13Z
  generation: 0
  name: invalid-job-spec
  namespace: default
  resourceVersion: "3498435"
  selfLink: /apis/tensorflow.org/v1alpha1/namespaces/default/tfjobs/invalid-job-spec
  uid: c20fec99-df71-11e7-b6b9-42010af0014d
spec:
  RuntimeId: ""
  replicaSpecs:
  - IsDefaultPS: false
    replicas: 1
    template:
      metadata:
        creationTimestamp: null
      spec:
        containers:
        - command:
          - python
          - main.py
          - --log-dir
          - /tmp/tensorflow/
          image: gcr.io/some/noneexistant-image:latest
          name: tensorflow
          resources: {}
        restartPolicy: OnFailure
    tfPort: 2222
    tfReplicaType: MASTER
  tensorboard:
    logDir: ""
    serviceType: ""
    volumeMounts: null
    volumes: null
  tfImage: tensorflow/tensorflow:1.3.0
status:
  conditions: null
  controlPaused: false
  phase: Failed
  reason: tbReplicaSpec.LogDir must be specified
  replicaStatuses: null
  state: ""

The state isn't set but it should be set to failed.

* Fixes #218 * If a job can't be initialized (e.g. because the spec is invalid) the state should be set to failed. * I refactored the code so that in the event that the call to setup fails, it will end up getting retried as part of the next attempt at reconcilation * I moved the call to setup into run * We also refactored the code for reconciling the state into a separate function reconcile. I did this because we want to call reconcile when run is first invoked and not wait for the reconcile interval to elapse. * Delete the method triggerCreatePhase; I moved the relevant code for setting the phase into setup. * Delete some code that is a legacy of the etcd-operator and no longer being used.

jlewi added the kind/bug label Dec 12, 2017

jlewi mentioned this issue Dec 12, 2017

Submitted tfjobs cease to start running under unknown conditions #203

Closed

jlewi mentioned this issue Dec 12, 2017

Set state to failed if there is a problem initializing job #219

Merged

jlewi closed this as completed in #219 Dec 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TfJob should be marked as failed if setup fails #218

TfJob should be marked as failed if setup fails #218

jlewi commented Dec 12, 2017

jlewi commented Dec 12, 2017

TfJob should be marked as failed if setup fails #218

TfJob should be marked as failed if setup fails #218

Comments

jlewi commented Dec 12, 2017

jlewi commented Dec 12, 2017