-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E2E tests timing out; job appears to remain in running state even though job is done. #500
Comments
INFO|2018-03-23T19:50:43|py/tf_job_client.py|96| Job gpu-tfjob in namespace default; uid=d7987218-2ed2-11e8-98df-42010a8e0096; phase=Creating, state=Running, |
From those logs we are able to get the cluster name that the TFJob E2E test created. We can then fetch the event logs for that cluster We can filter logs to a TFJob using the query
|
The result is successfully printed out at 12:47. The logs in comment #1 indicate the job is still in state running at 12:50 as far as the job controller is concerned. |
We can use the filter below to get logs for the TFJob controller.
|
The TFJob controller is using the image So its an image built at head as expected. |
Timeouts are occurring in our postsubmits. |
This looks like the first time the postsubmit started failing Which is the commit that changed us to creating pods instead of controllers The presubmit had actually failed but the status was improperly reported as passing on Git. |
So my conjecture is that when we switched to pods we introduced a bug whereby we never transition from Running state to done state. |
Here's the bug. GetSingleReplicaStatus is getting the pod by name here But the name is incorrect; genName doesn't include the random salt attached to the name. The code below it does the correct thing; i.e. list pods by index and then get the status from the pod list. So we just need to delete some code. |
* A bug was introduced with getting the replica status in kubeflow#344 which switched to creating pods directly. * Our presubmits/postsubmits were failing but this went unnoticed because the git status check was improperly reported as succeeded. * The bug is because we try to get the pod status by name but the name doesn't include the random salt in the pod name. * The code in question is a legacy of when we were using job controllers and we first got the status of the job controller. We incorrectly changed that code to get the pod. The correct thing is to just list pods by label; we already do that in the code below so we just need to delete some code. * Fix kubeflow#500
While trying to fix this. I started to notice problems similar to #322 in which we observed recreating a job with the same name would cause the job to get stuck until we deleted the TFJob operator pod |
This latter problem appears to be an issue with caching information about the controller. What is the proper way to make sure the controller is resilient to such failures? In particular, I would expect that periodically the controller calls list TFJobs and enqueue's a work item for any TFJobs that exist. But that doesn't seem to be happening and we end forgetting about some TFJobs in the system. @gaocegege @ScorpioCPH any ideas? |
I think my previous comment about this being a caching problem is wrong. Here's my conjecture
Here are logs to support this 16:21 Creating non-existent tfjobs default.simple-tfjob ... So it timed out waiting for it to be deleted tf-operator log 16:21 Creating pod: simple-tfjob-master-r5kf-0-u72dl ... 16:22:15 16:22:23 Creating pod: simple-tfjob-master-r5kf-0-kpcv9 16:25 16:28 The TFJob operator logs show we end up recreating resources after the job has succeeded and after the job has been deleted. |
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/kubeflow_tf-operator/485/kubeflow-tf-operator-presubmit/246/
The text was updated successfully, but these errors were encountered: