Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E tests timing out; job appears to remain in running state even though job is done. #500

Closed
jlewi opened this issue Mar 23, 2018 · 12 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Mar 23, 2018

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/kubeflow_tf-operator/485/kubeflow-tf-operator-presubmit/246/

Timeout waiting for gpu-tfjob in namespace default to finish.
@jlewi
Copy link
Contributor Author

jlewi commented Mar 23, 2018

@jlewi
Copy link
Contributor Author

jlewi commented Mar 23, 2018

We can use the filter below to get logs for the TFJob controller.

resource.type="container"
resource.labels.cluster_name="zresubmit-tfjob-e2e-485-d23ba90-246-13ee"
resource.labels.container_name="tf-job-operator"`

@jlewi
Copy link
Contributor Author

jlewi commented Mar 23, 2018

The TFJob controller is using the image
gcr.io/kubeflow-ci/tf_operator:kubeflow-tf-operator-presubmit-tfjob-e2e-485-d23ba90-246-13ee

So its an image built at head as expected.

@jlewi
Copy link
Contributor Author

jlewi commented Mar 23, 2018

Timeouts are occurring in our postsubmits.
https://k8s-testgrid.appspot.com/sig-big-data#kubeflow-tf-operator-postsubmit

@jlewi
Copy link
Contributor Author

jlewi commented Mar 23, 2018

This looks like the first time the postsubmit started failing
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubeflow_tf-operator/kubeflow-tf-operator-postsubmit/25/

Which is the commit that changed us to creating pods instead of controllers
6706903

The presubmit had actually failed
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/kubeflow_tf-operator/344/kubeflow-tf-operator-presubmit/137/

but the status was improperly reported as passing on Git.

@jlewi jlewi changed the title E2E tests timing out E2E tests timing out; job appears to remain in running state even though job is done. Mar 23, 2018
@jlewi
Copy link
Contributor Author

jlewi commented Mar 23, 2018

So my conjecture is that when we switched to pods we introduced a bug whereby we never transition from Running state to done state.

@jlewi
Copy link
Contributor Author

jlewi commented Mar 23, 2018

Here's the bug.

GetSingleReplicaStatus is getting the pod by name here
https://github.com/kubeflow/tf-operator/blob/master/pkg/trainer/replicas.go#L334

But the name is incorrect; genName doesn't include the random salt attached to the name.

The code below it does the correct thing; i.e. list pods by index and then get the status from the pod list. So we just need to delete some code.

jlewi added a commit to jlewi/k8s that referenced this issue Mar 23, 2018
* A bug was introduced with getting the replica status in kubeflow#344 which
switched to creating pods directly.

* Our presubmits/postsubmits were failing but this went unnoticed because
the git status check was improperly reported as succeeded.

* The bug is because we try to get the pod status by name but the name
doesn't include the random salt in the pod name.

* The code in question is a legacy of when we were using job controllers and
we first got the status of the job controller. We incorrectly changed that
code to get the pod. The correct thing is to just list pods by label; we
already do that in the code below so we just need to delete some code.

* Fix kubeflow#500
@jlewi
Copy link
Contributor Author

jlewi commented Mar 24, 2018

While trying to fix this. I started to notice problems similar to #322 in which we observed recreating a job with the same name would cause the job to get stuck until we deleted the TFJob operator pod

@jlewi
Copy link
Contributor Author

jlewi commented Mar 24, 2018

This latter problem appears to be an issue with caching information about the controller.

What is the proper way to make sure the controller is resilient to such failures? In particular, I would expect that periodically the controller calls list TFJobs and enqueue's a work item for any TFJobs that exist. But that doesn't seem to be happening and we end forgetting about some TFJobs in the system.

@gaocegege @ScorpioCPH any ideas?

@jlewi
Copy link
Contributor Author

jlewi commented Mar 25, 2018

I think my previous comment about this being a caching problem is wrong.

Here's my conjecture

  • We issue a delete for TFJob
  • We use Foreground Deletion
  • So deletion of TFJob is blocked until child resources are deleted
  • While waiting for resources to be deleted, TFJob calls reconcile/SyncTFJob recreating child resources as they are being deleted
  • This ends up preventing the TFJob from being deleted unless K8s can delete it faster than the operator can create them.

Here are logs to support this

16:21 Creating non-existent tfjobs default.simple-tfjob
....
16:22 Job simple-tfjob in namespace default; uid=10df71e6-2fba-11e8-865b-42010a8e0039; phase=Done, state=Succeeded,
16:22 Deleting job default.simple-tfjob
16:22 "INFO|2018-03-24T23:22:23|py/tf_job_client.py|66| Deleting job default.simple-tfjob returned: {u'status': {u'phase': u'Done', u'reason': u'', u'replicaStatuses': [{u'tf_replica_type': u'MASTER', u'state': u'Succeeded', u'ReplicasStates': {u'Succeeded': 1}}, {u'tf_replica_type': u'WORKER', u'state': u'Running', u'ReplicasStates': {u'Running': 1}}, {u'tf_replica_type': u'PS', u'state': u'Running', u'ReplicasStates': {u'Running': 2}}], u'state': u'Succeeded'}, u'kind': u'TFJob', u'spec': {u'replicaSpecs': [{u'tfReplicaType': u'MASTER', u'tfPort': 2222, u'template': {u'spec': {u'restartPolicy': u'OnFailure', u'containers': [{u'image': u'gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff', u'name': u'tensorflow', u'resources': {}}]}, u'metadata': {u'creationTimestamp': None}}, u'replicas': 1}, {u'tfReplicaType': u'WORKER', u'tfPort': 2222, u'template': {u'spec': {u'restartPolicy': u'OnFailure', u'containers': [{u'image': u'gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff', u'name': u'tensorflow', u'resources': {}}]}, u'metadata': {u'creationTimestamp': None}}, u'replicas': 1}, {u'tfReplicaType': u'PS', u'tfPort': 2222, u'template': {u'spec': {u'restartPolicy': u'OnFailure', u'containers': [{u'image': u'gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff', u'name': u'tensorflow', u'resources': {}}]}, u'metadata': {u'creationTimestamp': None}}, u'replicas': 2}], u'terminationPolicy': {u'chief': {u'replicaIndex': 0, u'replicaName': u'MASTER'}}, u'RuntimeId': u'r5kf', u'tfImage': u'tensorflow/tensorflow:1.3.0'}, u'apiVersion': u'kubeflow.org/v1alpha1', u'metadata': {u'name': u'simple-tfjob', u'deletionTimestamp': u'2018-03-24T23:22:23Z', u'clusterName': u'', u'deletionGracePeriodSeconds': 0, u'namespace': u'default', u'generation': 0, u'finalizers': [u'foregroundDeletion'], u'resourceVersion': u'1035', u'creationTimestamp': u'2018-03-24T23:21:22Z', u'selfLink': u'/apis/kubeflow.org/v1alpha1/namespaces/default/tfjobs/simple-tfjob', u'uid': u'10df71e6-2fba-11e8-865b-42010a8e0039'}}
"

...
16:26 Job simple-tfjob in namespace default; uid=10df71e6-2fba-11e8-865b-42010a8e0039; phase=Done, state=Succeeded,
16:26 /mnt/test-data-volume/kubeflow-tf-operator-presubmit-tfjob-e2e-501-8ca5b96-255-978a/src/kubeflow/tf-operator/py/test_runner.py|228| Timeout waiting for simple-tfjob in namespace default to finish.

So it timed out waiting for it to be deleted

tf-operator log

16:21 Creating pod: simple-tfjob-master-r5kf-0-u72dl

...

16:22:15
filename: "trainer/training.go:384"
level: "info"
msg: "Master succeeded Job: simple-tfjob."

16:22:23 Creating pod: simple-tfjob-master-r5kf-0-kpcv9

16:25
filename: "trainer/training.go:388"
level: "info"
msg: "Master running Job: simple-tfjob."

16:28
filename: "controller/controller.go:234"
job: "default/simple-tfjob"
level: "info"
msg: "Job has been deleted: default/simple-tfjob"

The TFJob operator logs show we end up recreating resources after the job has succeeded and after the job has been deleted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants