Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Head #293

Closed
4 tasks done
jlewi opened this issue Jan 11, 2018 · 6 comments
Closed
4 tasks done

Fix Head #293

jlewi opened this issue Jan 11, 2018 · 6 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jan 11, 2018

Uber bug for fixing head.

A lot of bugs crept in during the refactor because of #280 which meant jobs which failed were actually indicated as successful.

@jlewi
Copy link
Contributor Author

jlewi commented Jan 12, 2018

Here's the latest postsubmit result

There are two failures.

gs/e2e_tests_dag_test.py failed.

And pylint issues.

@jlewi
Copy link
Contributor Author

jlewi commented Jan 14, 2018

Most recent postsubmit had a single test failure which was the GPU test timing out.

@jlewi
Copy link
Contributor Author

jlewi commented Jan 14, 2018

It looks like the TfJob status might not be updated correctly. A job is stuck in the "creating" state even though master has exited and job should be marked as completed.

@jlewi
Copy link
Contributor Author

jlewi commented Jan 14, 2018

With the refactor to use the informer and controller classes, does TrainingJob.Reconcile get called periodically? Or only in response to some event?

/cc @wackxu @ScorpioCPH

@ScorpioCPH
Copy link
Member

@jlewi It is not called periodically, I think it is event-driven:

  • We need to set up an event handler for when TFJob resources changed.
  • It will enqueue that TFJob resource for processing.
  • syncTFJob will dequeue and process the single work item (TFJob) and call TrainingJob.Reconcile.

jlewi pushed a commit that referenced this issue Jan 15, 2018
* Add an Update function to the controller.

* The informer periodically generates an Update event but we aren't processing these because we don't have an update function.

* As a result, TrainingJob.reconcile doesn't get called periodically and we aren't properly updating the status of the job.

ref #309
ref #293
@jlewi
Copy link
Contributor Author

jlewi commented Jan 16, 2018

Head is now fixed. Here's the latest passing postsubmit
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/tensorflow_k8s/tf-k8s-postsubmit/108/

@jlewi jlewi closed this as completed Jan 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants