E2E tests timing out; job appears to remain in running state even though job is done. #500

jlewi · 2018-03-23T20:40:06Z

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/kubeflow_tf-operator/485/kubeflow-tf-operator-presubmit/246/

Timeout waiting for gpu-tfjob in namespace default to finish.

The text was updated successfully, but these errors were encountered:

jlewi · 2018-03-23T20:43:07Z

Here are the logs from the run_gpu tests.
https://console.cloud.google.com/logs/viewer?project=kubeflow-ci&organizationId=714441643818&minLogLevel=0&expandAll=false&timestamp=2018-03-23T17:53:24.511000000Z&dateRangeStart=2018-03-23T19:41:54.670Z&dateRangeEnd=2018-03-23T20:41:54.670Z&interval=PT1H&resource=container%2Fcluster_name%2Fkubeflow-testing%2Fnamespace_id%2Fkubeflow-test-infra&advancedFilter=resource.type%3D%22container%22%0Aresource.labels.cluster_name%3D%22kubeflow-testing%22%0Aresource.labels.namespace_id%3D%22kubeflow-test-infra%22%0Aresource.labels.container_name%3D%22main%22%0Aresource.labels.pod_id%3D%22kubeflow-tf-operator-presubmit-tfjob-e2e-485-d23ba90-246-13ee-2999475040%22%0A&scrollTimestamp=2018-03-23T19:50:43.000000000Z

INFO|2018-03-23T19:50:43|py/tf_job_client.py|96| Job gpu-tfjob in namespace default; uid=d7987218-2ed2-11e8-98df-42010a8e0096; phase=Creating, state=Running,

jlewi · 2018-03-23T21:01:44Z

From those logs we are able to get the cluster name that the TFJob E2E test created.

We can then fetch the event logs for that cluster

https://console.cloud.google.com/logs/viewer?project=kubeflow-ci&organizationId=714441643818&minLogLevel=0&expandAll=false&timestamp=2018-03-23T20:46:54.895000000Z&dateRangeStart=2018-03-23T19:41:54.670Z&dateRangeEnd=2018-03-23T20:41:54.670Z&interval=CUSTOM&resource=container%2Fcluster_name%2Fkubeflow-testing%2Fnamespace_id%2Fkubeflow-test-infra&scrollTimestamp=2018-03-23T19:47:38.000000000Z&advancedFilter=resource.labels.cluster_name%3D%22zresubmit-tfjob-e2e-485-d23ba90-246-13ee%22%0AlogName%3D%22projects%2Fkubeflow-ci%2Flogs%2Fevents%22%0A

We can filter logs to a TFJob using the query

resource.labels.cluster_name="zresubmit-tfjob-e2e-485-d23ba90-246-13ee"
logName="projects/kubeflow-ci/logs/events"
jsonPayload.involvedObject.name="gpu-tfjob"

jlewi · 2018-03-23T21:17:41Z

Here are the logs for the GPU master pod
https://console.cloud.google.com/logs/viewer?project=kubeflow-ci&organizationId=714441643818&minLogLevel=0&expandAll=false&timestamp=2018-03-23T20:46:54.895000000Z&dateRangeStart=2018-03-23T19:41:54.670Z&dateRangeEnd=2018-03-23T20:41:54.670Z&interval=CUSTOM&resource=container%2Fcluster_name%2Fkubeflow-testing%2Fnamespace_id%2Fkubeflow-test-infra&scrollTimestamp=2018-03-23T19:47:48.000000000Z&advancedFilter=resource.type%3D%22container%22%0Aresource.labels.cluster_name%3D%22zresubmit-tfjob-e2e-485-d23ba90-246-13ee%22%0Aresource.labels.container_name%3D%22tensorflow%22%0Aresource.labels.pod_id%3D%22gpu-tfjob-master-md2o-0-byn0b%22%0A

The result is successfully printed out at 12:47.

The logs in comment #1 indicate the job is still in state running at 12:50 as far as the job controller is concerned.

jlewi · 2018-03-23T21:19:08Z

We can use the filter below to get logs for the TFJob controller.

resource.type="container"
resource.labels.cluster_name="zresubmit-tfjob-e2e-485-d23ba90-246-13ee"
resource.labels.container_name="tf-job-operator"`

jlewi · 2018-03-23T21:24:52Z

The TFJob controller is using the image
gcr.io/kubeflow-ci/tf_operator:kubeflow-tf-operator-presubmit-tfjob-e2e-485-d23ba90-246-13ee

So its an image built at head as expected.

jlewi · 2018-03-23T21:25:37Z

Timeouts are occurring in our postsubmits.
https://k8s-testgrid.appspot.com/sig-big-data#kubeflow-tf-operator-postsubmit

jlewi · 2018-03-23T21:27:49Z

This looks like the first time the postsubmit started failing
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubeflow_tf-operator/kubeflow-tf-operator-postsubmit/25/

Which is the commit that changed us to creating pods instead of controllers
6706903

The presubmit had actually failed
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/kubeflow_tf-operator/344/kubeflow-tf-operator-presubmit/137/

but the status was improperly reported as passing on Git.

jlewi · 2018-03-23T21:34:29Z

So my conjecture is that when we switched to pods we introduced a bug whereby we never transition from Running state to done state.

jlewi · 2018-03-23T21:45:05Z

Here's the bug.

GetSingleReplicaStatus is getting the pod by name here
https://github.com/kubeflow/tf-operator/blob/master/pkg/trainer/replicas.go#L334

But the name is incorrect; genName doesn't include the random salt attached to the name.

The code below it does the correct thing; i.e. list pods by index and then get the status from the pod list. So we just need to delete some code.

* A bug was introduced with getting the replica status in kubeflow#344 which switched to creating pods directly. * Our presubmits/postsubmits were failing but this went unnoticed because the git status check was improperly reported as succeeded. * The bug is because we try to get the pod status by name but the name doesn't include the random salt in the pod name. * The code in question is a legacy of when we were using job controllers and we first got the status of the job controller. We incorrectly changed that code to get the pod. The correct thing is to just list pods by label; we already do that in the code below so we just need to delete some code. * Fix kubeflow#500

jlewi · 2018-03-24T22:17:37Z

While trying to fix this. I started to notice problems similar to #322 in which we observed recreating a job with the same name would cause the job to get stuck until we deleted the TFJob operator pod

jlewi · 2018-03-24T22:23:57Z

This latter problem appears to be an issue with caching information about the controller.

What is the proper way to make sure the controller is resilient to such failures? In particular, I would expect that periodically the controller calls list TFJobs and enqueue's a work item for any TFJobs that exist. But that doesn't seem to be happening and we end forgetting about some TFJobs in the system.

@gaocegege @ScorpioCPH any ideas?

jlewi · 2018-03-25T00:20:37Z

I think my previous comment about this being a caching problem is wrong.

Here's my conjecture

We issue a delete for TFJob
We use Foreground Deletion
So deletion of TFJob is blocked until child resources are deleted
While waiting for resources to be deleted, TFJob calls reconcile/SyncTFJob recreating child resources as they are being deleted
This ends up preventing the TFJob from being deleted unless K8s can delete it faster than the operator can create them.

Here are logs to support this

16:21 Creating non-existent tfjobs default.simple-tfjob
....
16:22 Job simple-tfjob in namespace default; uid=10df71e6-2fba-11e8-865b-42010a8e0039; phase=Done, state=Succeeded,
16:22 Deleting job default.simple-tfjob
16:22 "INFO|2018-03-24T23:22:23|py/tf_job_client.py|66| Deleting job default.simple-tfjob returned: {u'status': {u'phase': u'Done', u'reason': u'', u'replicaStatuses': [{u'tf_replica_type': u'MASTER', u'state': u'Succeeded', u'ReplicasStates': {u'Succeeded': 1}}, {u'tf_replica_type': u'WORKER', u'state': u'Running', u'ReplicasStates': {u'Running': 1}}, {u'tf_replica_type': u'PS', u'state': u'Running', u'ReplicasStates': {u'Running': 2}}], u'state': u'Succeeded'}, u'kind': u'TFJob', u'spec': {u'replicaSpecs': [{u'tfReplicaType': u'MASTER', u'tfPort': 2222, u'template': {u'spec': {u'restartPolicy': u'OnFailure', u'containers': [{u'image': u'gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff', u'name': u'tensorflow', u'resources': {}}]}, u'metadata': {u'creationTimestamp': None}}, u'replicas': 1}, {u'tfReplicaType': u'WORKER', u'tfPort': 2222, u'template': {u'spec': {u'restartPolicy': u'OnFailure', u'containers': [{u'image': u'gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff', u'name': u'tensorflow', u'resources': {}}]}, u'metadata': {u'creationTimestamp': None}}, u'replicas': 1}, {u'tfReplicaType': u'PS', u'tfPort': 2222, u'template': {u'spec': {u'restartPolicy': u'OnFailure', u'containers': [{u'image': u'gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff', u'name': u'tensorflow', u'resources': {}}]}, u'metadata': {u'creationTimestamp': None}}, u'replicas': 2}], u'terminationPolicy': {u'chief': {u'replicaIndex': 0, u'replicaName': u'MASTER'}}, u'RuntimeId': u'r5kf', u'tfImage': u'tensorflow/tensorflow:1.3.0'}, u'apiVersion': u'kubeflow.org/v1alpha1', u'metadata': {u'name': u'simple-tfjob', u'deletionTimestamp': u'2018-03-24T23:22:23Z', u'clusterName': u'', u'deletionGracePeriodSeconds': 0, u'namespace': u'default', u'generation': 0, u'finalizers': [u'foregroundDeletion'], u'resourceVersion': u'1035', u'creationTimestamp': u'2018-03-24T23:21:22Z', u'selfLink': u'/apis/kubeflow.org/v1alpha1/namespaces/default/tfjobs/simple-tfjob', u'uid': u'10df71e6-2fba-11e8-865b-42010a8e0039'}}
"

...
16:26 Job simple-tfjob in namespace default; uid=10df71e6-2fba-11e8-865b-42010a8e0039; phase=Done, state=Succeeded,
16:26 /mnt/test-data-volume/kubeflow-tf-operator-presubmit-tfjob-e2e-501-8ca5b96-255-978a/src/kubeflow/tf-operator/py/test_runner.py|228| Timeout waiting for simple-tfjob in namespace default to finish.

So it timed out waiting for it to be deleted

tf-operator log

16:21 Creating pod: simple-tfjob-master-r5kf-0-u72dl

...

16:22:15
filename: "trainer/training.go:384"
level: "info"
msg: "Master succeeded Job: simple-tfjob."

16:22:23 Creating pod: simple-tfjob-master-r5kf-0-kpcv9

16:25
filename: "trainer/training.go:388"
level: "info"
msg: "Master running Job: simple-tfjob."

16:28
filename: "controller/controller.go:234"
job: "default/simple-tfjob"
level: "info"
msg: "Job has been deleted: default/simple-tfjob"

The TFJob operator logs show we end up recreating resources after the job has succeeded and after the job has been deleted.

jlewi changed the title ~~E2E tests timing out~~ E2E tests timing out; job appears to remain in running state even though job is done. Mar 23, 2018

jlewi mentioned this issue Mar 23, 2018

Fix bug with jobs not being marked as completed. #501

Merged

gaocegege added kind/e2e-test-failure kind/bug area/testing priority/p0 labels Mar 25, 2018

k8s-ci-robot closed this as completed in #501 Mar 26, 2018

jlewi mentioned this issue Mar 26, 2018

Cut a 0.1 release kubeflow/kubeflow#506

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E tests timing out; job appears to remain in running state even though job is done. #500

E2E tests timing out; job appears to remain in running state even though job is done. #500

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018 •

edited

Loading

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 24, 2018

jlewi commented Mar 24, 2018

jlewi commented Mar 25, 2018

E2E tests timing out; job appears to remain in running state even though job is done. #500

E2E tests timing out; job appears to remain in running state even though job is done. #500

Comments

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018 • edited Loading

jlewi commented Mar 23, 2018

jlewi commented Mar 23, 2018

jlewi commented Mar 24, 2018

jlewi commented Mar 24, 2018

jlewi commented Mar 25, 2018

jlewi commented Mar 23, 2018 •

edited

Loading