-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug with jobs not being marked as completed. #501
Conversation
/retest |
GPU test is passing but not simple TFJob. |
It seems that simple TFJob times out |
The job has been deleted but we can get it via |
@jlewi I think we could merge the PR then I could try to file a new PR to fix simple tfjob test. |
I think it is caused by https://github.com/kubeflow/tf-operator/blob/0759f7ae5142ed2e78a6971e9703fdc86b7307cd/py/tf_job_client.py#L64:28 We do not check the response of the delete request. |
Interested. Seems strange that it's only happening now and not with GPU jobs. |
* A bug was introduced with getting the replica status in kubeflow#344 which switched to creating pods directly. * Our presubmits/postsubmits were failing but this went unnoticed because the git status check was improperly reported as succeeded. * The bug is because we try to get the pod status by name but the name doesn't include the random salt in the pod name. * The code in question is a legacy of when we were using job controllers and we first got the status of the job controller. We incorrectly changed that code to get the pod. The correct thing is to just list pods by label; we already do that in the code below so we just need to delete some code. * Don't create any resources if the DeletionTimestamp is set. Creating resources at this point would end blocking deletion of the object because the controller would create resources while we are trying to delete them. * Use logrus in controller.go, trainer.go, and replicas.go to log with fields providing information about the job and repliac. This makes it easy to filter logs for a particular job. * Use logrus to log the name of the job in a field.
/unassign @zjj2wry |
Use the Phase to determine whether we should create resources.
@gaocegege Any idea what the lint failure means
|
/retest |
It seems that you already fixed the lint errors in travis. |
For the most recent E2E failure. The test submits multiple TFJobs with the same name.
The logs indicate that between 21:00:12 and 21:07 (when cluster is deleted because of test-runner timeout) reconcile is never called for the TFJob. |
My conjecture is that the problem is the ratelimiting work queue. We use the defaults which uses exponential backoff with a max delay per item of 1000 seconds. So every 30 seconds the informer is queuing an update event but these are then being rate limited and processed with a max delay of 1000 seconds. The log also indicates that we call forget on the workqueue only once and thats for the gpu job. I think this is a bug because I think forget causes the ratelimiter to reset so we should be calling it after every successful call so that we don't invoke rate limits. |
* Otherwise the ratelimiter will end up delaying processing subsequent events which isn't what we want.
…ures TrainingJob has an up to date representation of the job. * Otherwise changes made to the spec won't be available to TrainingJob. For example, if the job is deleted by the user, the deletion timestamp will be set. But if we don't update the TFJob stored in TrainingJob this change won't be propogated.
Looks like the test passed. Lets make sure its not a fluke. |
/test all |
@gaocegege PTAL |
/lgtm I think the test-and-forget logic is copied from job controller and I think it is not suitable for tfjob. /cc @ScorpioCPH |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gaocegege The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks for the fix! |
* Fix bug with jobs not being marked as completed. * A bug was introduced with getting the replica status in kubeflow#344 which switched to creating pods directly. * Our presubmits/postsubmits were failing but this went unnoticed because the git status check was improperly reported as succeeded. * The bug is because we try to get the pod status by name but the name doesn't include the random salt in the pod name. * The code in question is a legacy of when we were using job controllers and we first got the status of the job controller. We incorrectly changed that code to get the pod. The correct thing is to just list pods by label; we already do that in the code below so we just need to delete some code. * Don't create any resources if the DeletionTimestamp is set. Creating resources at this point would end blocking deletion of the object because the controller would create resources while we are trying to delete them. * Use logrus in controller.go, trainer.go, and replicas.go to log with fields providing information about the job and repliac. This makes it easy to filter logs for a particular job. * Use logrus to log the name of the job in a field. * Checking the deletiontime stamp doesn't appear to be sufficient. Use the Phase to determine whether we should create resources. * Run gofmt. * * Reset the rate limiter after every successful sync. * Otherwise the ratelimiter will end up delaying processing subsequent events which isn't what we want. * Run goimports to fix lint issues. * * Reconcile needs to update the TFJob stored in TrainingJob. This ensures TrainingJob has an up to date representation of the job. * Otherwise changes made to the spec won't be available to TrainingJob. For example, if the job is deleted by the user, the deletion timestamp will be set. But if we don't update the TFJob stored in TrainingJob this change won't be propogated. * * TrainingJob.update should log the value of the job not the pointer. * Add more comments to the code.
Fix several bugs with the job controller.
A bug was introduced with getting the replica status in Create Pod instead of Job #344 which
switched to creating pods directly.
Our presubmits/postsubmits were failing but this went unnoticed because
the git status check was improperly reported as succeeded.
One bug is because we try to get the pod status by name but the name
doesn't include the random salt in the pod name.
The code in question is a legacy of when we were using job controllers and
we first got the status of the job controller. We incorrectly changed that
code to get the pod. The correct thing is to just list pods by label; we
already do that in the code below so we just need to delete some code.
A second bug is a problem with deleting resources.
Once a job is marked for deletion we shouldn't create any more resources. This would block
deletion because we do foreground deletion so the TFJob won't be deleted until all child resources
have been deleted.
When the job is deleted the DeletionTimestamp will be set on the object and we can use that
inside the syncFunction
But reconcile needs to update the TFJob stored inside TrainingJob so that we pick up changes
to TFJob made external to the operator; e.g. by the user issuing a delete request.
A third problem is resetting the rate limiter on the work queue
of a work item. This way if we receive another event we can process it immediately.
Increase the timeout we wait for the job to finish.
Use logrus to log fields providing useful metadata like the job that a log entry goes with.
Fix E2E tests timing out; job appears to remain in running state even though job is done. #500
This change is