Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set state to failed if there is a problem initializing job #219

Merged
merged 5 commits into from
Dec 15, 2017

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Dec 12, 2017

  • Fixes TfJob should be marked as failed if setup fails #218
  • If a job can't be initialized (e.g. because the spec is invalid) the state should be set to failed.
  • I refactored the code so that in the event that the call to setup fails, it will end up getting retried as part of the next attempt at reconcilation
    • I moved the call to setup into run
    • We also refactored the code for reconciling the state into a separate function reconcile. I did this because we want to call reconcile when run is first invoked and not wait for the reconcile interval to elapse.
  • Delete the method triggerCreatePhase; I moved the relevant code for setting the phase into setup.
  • Delete some code that is a legacy of the etcd-operator and no longer being used.

This change is Reviewable

* Clean up the code; there's no longer a reason to have a separate function
  triggerCreatePhase.

* Refactor setup to ensure we always set phase and state.
* We should always call updateTPRStatus after setup is called
  * Before the update was called in two different places (e.g. triggerCreatPhase) depending on the state.
* Call reconcile before the loop and inside it.
@jlewi
Copy link
Contributor Author

jlewi commented Dec 12, 2017

@ wackxu could you please review this?

@jlewi
Copy link
Contributor Author

jlewi commented Dec 12, 2017

@wackxu could you please review this?

err := j.job.Spec.SetDefaults()
if err != nil {
return fmt.Errorf("there was a problem setting defaults for job spec: %v", err)
j.status.SetReason("Internal error; tried to setup a job with no spec.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe tried to --> failed to would be better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Make sure the runtime id is set.
if job.job.Spec.RuntimeId == "" {
if job.status.Phase != c.expectPhase {
t.Errorf("job.job.Status.Phase Want: %v Got:%v ", c.expectPhase, job.status.Phase)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, i like this test style

if err := j.updateTPRStatus(); err != nil {
log.Warningf("Job %v; failed to update TPR status error: %v", j.job.Metadata.Name, err)
}
j.reconcile(config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we will do this every 8 minute intervals, if i understand this code, maybe we only need do setup when my job is created and phase is none.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reconcile function is the code that periodically checks the job and figures out what needs to be done. The reconcile function takes advantage of state (e.g. phase) preserved in the CRD to avoid repeating unnecessary steps like setup.

Copy link
Contributor Author

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. PTAL.

@zjj2wry
Copy link
Member

zjj2wry commented Dec 14, 2017

this big enhance code, LGTM

@jlewi jlewi merged commit eb0fd5f into kubeflow:master Dec 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants