Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug with jobs not being marked as completed. #501

Merged
merged 8 commits into from
Mar 26, 2018

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Mar 23, 2018

Fix several bugs with the job controller.

  • A bug was introduced with getting the replica status in Create Pod instead of Job #344 which
    switched to creating pods directly.

  • Our presubmits/postsubmits were failing but this went unnoticed because
    the git status check was improperly reported as succeeded.

  • One bug is because we try to get the pod status by name but the name
    doesn't include the random salt in the pod name.

  • The code in question is a legacy of when we were using job controllers and
    we first got the status of the job controller. We incorrectly changed that
    code to get the pod. The correct thing is to just list pods by label; we
    already do that in the code below so we just need to delete some code.

  • A second bug is a problem with deleting resources.

    • Once a job is marked for deletion we shouldn't create any more resources. This would block
      deletion because we do foreground deletion so the TFJob won't be deleted until all child resources
      have been deleted.

    • When the job is deleted the DeletionTimestamp will be set on the object and we can use that
      inside the syncFunction

    • But reconcile needs to update the TFJob stored inside TrainingJob so that we pick up changes
      to TFJob made external to the operator; e.g. by the user issuing a delete request.

  • A third problem is resetting the rate limiter on the work queue

    • I think we want to reset the rateLimiter by calling forget after every successful processing
      of a work item. This way if we receive another event we can process it immediately.
  • Increase the timeout we wait for the job to finish.

  • Use logrus to log fields providing useful metadata like the job that a log entry goes with.

  • Fix E2E tests timing out; job appears to remain in running state even though job is done. #500


This change is Reviewable

@coveralls
Copy link

coveralls commented Mar 23, 2018

Coverage Status

Coverage decreased (-0.09%) to 45.323% when pulling 5f5d859 on jlewi:fix_pod into eec56b5 on kubeflow:master.

@jlewi
Copy link
Contributor Author

jlewi commented Mar 23, 2018

/retest

@jlewi
Copy link
Contributor Author

jlewi commented Mar 24, 2018

GPU test is passing but not simple TFJob.

@gaocegege
Copy link
Member

INFO\|2018-03-24T01:32:54\|py/tf_job_client.py\|96\| Job simple-tfjob in namespace default; uid=346f5fcd-2f03-11e8-9b8b-42010a8e008e; phase=Done, state=Succeeded,
INFO\|2018-03-24T01:32:54\|py/tf_job_client.py\|66\| Deleted job default.simple-tfjob
INFO\|2018-03-24T01:32:54\|py/tf_job_client.py\|63\| Deleting job default.simple-tfjob

The job has been deleted but we can get it via crd_api.get_namespaced_custom_object, that's why timesout.

@gaocegege
Copy link
Member

@jlewi I think we could merge the PR then I could try to file a new PR to fix simple tfjob test.

@gaocegege
Copy link
Member

I think it is caused by https://github.com/kubeflow/tf-operator/blob/0759f7ae5142ed2e78a6971e9703fdc86b7307cd/py/tf_job_client.py#L64:28

We do not check the response of the delete request.

@jlewi
Copy link
Contributor Author

jlewi commented Mar 24, 2018

Interested. Seems strange that it's only happening now and not with GPU jobs.

* A bug was introduced with getting the replica status in kubeflow#344 which
switched to creating pods directly.

* Our presubmits/postsubmits were failing but this went unnoticed because
the git status check was improperly reported as succeeded.

* The bug is because we try to get the pod status by name but the name
doesn't include the random salt in the pod name.

* The code in question is a legacy of when we were using job controllers and
we first got the status of the job controller. We incorrectly changed that
code to get the pod. The correct thing is to just list pods by label; we
already do that in the code below so we just need to delete some code.

* Don't create any resources if the DeletionTimestamp is set.
  Creating resources at this point would end blocking deletion of the object
  because the controller would create resources while we are trying to delete
  them.

* Use logrus in controller.go, trainer.go, and replicas.go to log
  with fields providing information about the job and repliac.
  This makes it easy to filter logs for a particular job.

* Use logrus to log the name of the job in a field.
@jlewi
Copy link
Contributor Author

jlewi commented Mar 25, 2018

/unassign @zjj2wry
/assign @gaocegege

Use the Phase to determine whether we should create resources.
@jlewi
Copy link
Contributor Author

jlewi commented Mar 25, 2018

@gaocegege Any idea what the lint failure means

The command "go build -o tf-operator github.com/kubeflow/tf-operator/cmd/tf-operator" exited with 0.
56.67s$ gometalinter --config=linter_config.json ./pkg/...
pkg/trainer/replicas.go:1::warning: file is not goimported (goimports)

@jlewi
Copy link
Contributor Author

jlewi commented Mar 25, 2018

/retest

@gaocegege
Copy link
Member

It seems that you already fixed the lint errors in travis.

@jlewi
Copy link
Contributor Author

jlewi commented Mar 25, 2018

For the most recent E2E failure.

The test submits multiple TFJobs with the same name.
The first job completes.
We then resubmit.
Pods are created and this job enters the running state.
TFOperator stops calling reconcile for the new job at 21:00:08 (shortly after the pods are created)

  • This could explain why the job is stuck in the running state.
  • The logs for the master pod indicate it ran to completion successfully.

The logs indicate that between 21:00:12 and 21:07 (when cluster is deleted because of test-runner timeout) reconcile is never called for the TFJob.
* My expectation though is that the informer should be calling reconcile every 30 seconds. Why isn't that happening?

@jlewi
Copy link
Contributor Author

jlewi commented Mar 25, 2018

My conjecture is that the problem is the ratelimiting work queue.

We use the defaults which uses exponential backoff with a max delay per item of 1000 seconds.

So every 30 seconds the informer is queuing an update event but these are then being rate limited and processed with a max delay of 1000 seconds.

The log also indicates that we call forget on the workqueue only once and thats for the gpu job.
{"filename":"controller/controller.go:195","job":"default/gpu-tfjob","msg":"WorkQueue forgetting key default/gpu-tfjob","level":"info"}

I think this is a bug because I think forget causes the ratelimiter to reset so we should be calling it after every successful call so that we don't invoke rate limits.

jlewi added 3 commits March 25, 2018 10:56
* Otherwise the ratelimiter will end up delaying processing subsequent
  events which isn't what we want.
…ures

  TrainingJob has an up to date representation of the job.

* Otherwise changes made to the spec won't be available to TrainingJob. For
  example, if the job is deleted by the user, the deletion timestamp will
  be set. But if we don't update the TFJob stored in TrainingJob this
  change won't be propogated.
@jlewi
Copy link
Contributor Author

jlewi commented Mar 25, 2018

Looks like the test passed. Lets make sure its not a fluke.

@jlewi
Copy link
Contributor Author

jlewi commented Mar 25, 2018

/test all

@jlewi
Copy link
Contributor Author

jlewi commented Mar 25, 2018

@gaocegege PTAL

@gaocegege
Copy link
Member

/lgtm

I think the test-and-forget logic is copied from job controller and I think it is not suitable for tfjob.

/cc @ScorpioCPH

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gaocegege
Copy link
Member

Thanks for the fix!

@k8s-ci-robot k8s-ci-robot merged commit b72f47e into kubeflow:master Mar 26, 2018
@jlewi jlewi mentioned this pull request Mar 27, 2018
jetmuffin pushed a commit to jetmuffin/tf-operator that referenced this pull request Jul 9, 2018
* Fix bug with jobs not being marked as completed.

* A bug was introduced with getting the replica status in kubeflow#344 which
switched to creating pods directly.

* Our presubmits/postsubmits were failing but this went unnoticed because
the git status check was improperly reported as succeeded.

* The bug is because we try to get the pod status by name but the name
doesn't include the random salt in the pod name.

* The code in question is a legacy of when we were using job controllers and
we first got the status of the job controller. We incorrectly changed that
code to get the pod. The correct thing is to just list pods by label; we
already do that in the code below so we just need to delete some code.

* Don't create any resources if the DeletionTimestamp is set.
  Creating resources at this point would end blocking deletion of the object
  because the controller would create resources while we are trying to delete
  them.

* Use logrus in controller.go, trainer.go, and replicas.go to log
  with fields providing information about the job and repliac.
  This makes it easy to filter logs for a particular job.

* Use logrus to log the name of the job in a field.

* Checking the deletiontime stamp doesn't appear to be sufficient.

Use the Phase to determine whether we should create resources.

* Run gofmt.

* * Reset the rate limiter after every successful sync.
* Otherwise the ratelimiter will end up delaying processing subsequent
  events which isn't what we want.

* Run goimports to fix lint issues.

* * Reconcile needs to update the TFJob stored in TrainingJob. This ensures
  TrainingJob has an up to date representation of the job.

* Otherwise changes made to the spec won't be available to TrainingJob. For
  example, if the job is deleted by the user, the deletion timestamp will
  be set. But if we don't update the TFJob stored in TrainingJob this
  change won't be propogated.

* * TrainingJob.update should log the value of the job not the pointer.

* Add more comments to the code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants