Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1alpha2] Invalid Job Status #712

Closed
codeflitting opened this issue Jul 4, 2018 · 3 comments · Fixed by #715
Closed

[v1alpha2] Invalid Job Status #712

codeflitting opened this issue Jul 4, 2018 · 3 comments · Fixed by #715

Comments

@codeflitting
Copy link
Member

codeflitting commented Jul 4, 2018

I created a TFJob,TFJobFailed come before TFJobRunning

Here is the TFJobSpec

apiVersion: "kubeflow.org/v1alpha2"
kind: "TFJob"
metadata:
  name: "dist-mnist-for-e2e-test"
spec:
  tfReplicaSpecs:
    PS:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: codeflitting/tf-dist-mnist-test:1.0
              command:
              - /bin/sh
              - -c
              - sleep
    Worker:
      replicas: 4
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: tensorflow
              image: codeflitting/tf-dist-mnist-test:1.0
              command:
              - /bin/sh
              - -c
              - sleep

JobStatus:

  status:
    conditions:
    - lastTransitionTime: 2018-07-04T02:53:58Z
      lastUpdateTime: 2018-07-04T02:53:58Z
      message: TFJob dist-mnist-for-e2e-test is created.
      reason: TFJobCreated
      status: "True"
      type: Created
    - lastTransitionTime: 2018-07-04T02:53:58Z
      lastUpdateTime: 2018-07-04T02:54:03Z
      message: TFJob dist-mnist-for-e2e-test is failed.
      reason: TFJobFailed
      status: "True"
      type: Failed
    - lastTransitionTime: 2018-07-04T02:53:58Z
      lastUpdateTime: 2018-07-04T02:54:03Z
      message: TFJob dist-mnist-for-e2e-test is running.
      reason: TFJobRunning
      status: "True"
      type: Running
    tfReplicaStatuses:
      Chief: {}
      PS: {}
      Worker: {}

Discovered by @yph152

@codeflitting
Copy link
Member Author

@gaocegege What do you think the final status should be

@gaocegege
Copy link
Member

I am not sure why the job is failed. Is there any worker failed?

@yph152
Copy link
Contributor

yph152 commented Jul 4, 2018

I think adding the following can solve it.

func setCondition(status *tfv1alpha2.TFJobStatus, condition tfv1alpha2.TFJobCondition) {                                                                                                                          
      if isFailed(*status) {                                                                                                                                                                                      
            return                                                                                                                                                                                                
      }                                                                                                                                                                                                           
      currentCond := getCondition(*status, condition.Type)                                                                                                                                                        
                                                                                                                                                                                                                  
      // Do nothing if condition doesn't change                                                                                                                                                                   
      if currentCond != nil && currentCond.Status == condition.Status && currentCond.Reason == condition.Reason {                                                                                                 
            return                                                                                                                                                                                                
      }                                                              

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants