GPU test marked as succeeded but airflow step is failing #240

jlewi · 2017-12-22T13:41:34Z

In our Airflow graph, the run_gpu_test isn't succeeding.

[2017-12-22 11:29:47,048] {base_task_runner.py:98} INFO - Subtask: INFO:root:Loading spec from /var/lib/data/runs/tf_k8s_tests/2017-12-22T11_23_38/tensorflow_k8s/examples/tf_job_gpu.yaml with image_tag=notag
[2017-12-22 11:29:47,048] {base_task_runner.py:98} INFO - Subtask: ERROR:root:Exception when calling DefaultApi->apis_fqdn_v1_namespaces_namespace_resource_post. body: 404 page not found
[2017-12-22 11:29:47,049] {base_task_runner.py:98} INFO - Subtask: 
[2017-12-22 11:29:47,049] {base_task_runner.py:98} INFO - Subtask: INFO:root:Creationg gs://kubernetes-jenkins/logs/tf-k8s-periodic/174/artifacts/junit_gpu-tests.xml
[2017-12-22 11:29:47,049] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2017-12-22 11:29:47,049] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/runpy.py", line 174, in _run_module_as_main
[2017-12-22 11:29:47,049] {base_task_runner.py:98} INFO - Subtask:     "__main__", fname, loader, pkg_name)
[2017-12-22 11:29:47,049] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
[2017-12-22 11:29:47,050] {base_task_runner.py:98} INFO - Subtask:     exec code in run_globals
[2017-12-22 11:29:47,050] {base_task_runner.py:98} INFO - Subtask:   File "/var/lib/data/runs/tf_k8s_tests/2017-12-22T11_23_38/tensorflow_k8s/py/test_runner.py", line 140, in <module>
[2017-12-22 11:29:47,050] {base_task_runner.py:98} INFO - Subtask:     main()
[2017-12-22 11:29:47,050] {base_task_runner.py:98} INFO - Subtask:   File "/var/lib/data/runs/tf_k8s_tests/2017-12-22T11_23_38/tensorflow_k8s/py/test_runner.py", line 137, in main
[2017-12-22 11:29:47,050] {base_task_runner.py:98} INFO - Subtask:     args.func(args)
[2017-12-22 11:29:47,051] {base_task_runner.py:98} INFO - Subtask:   File "/var/lib/data/runs/tf_k8s_tests/2017-12-22T11_23_38/tensorflow_k8s/py/test_runner.py", line 48, in run_test
[2017-12-22 11:29:47,051] {base_task_runner.py:98} INFO - Subtask:     api_response = tf_job_client.create_tf_job(api_client, spec)
[2017-12-22 11:29:47,051] {base_task_runner.py:98} INFO - Subtask:   File "/var/lib/data/runs/tf_k8s_tests/2017-12-22T11_23_38/tensorflow_k8s/py/tf_job_client.py", line 39, in create_tf_job
[2017-12-22 11:29:47,051] {base_task_runner.py:98} INFO - Subtask:     body = json.loads(e.body)
[2017-12-22 11:29:47,052] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/json/__init__.py", line 339, in loads
[2017-12-22 11:29:47,052] {base_task_runner.py:98} INFO - Subtask:     return _default_decoder.decode(s)
[2017-12-22 11:29:47,052] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/json/decoder.py", line 367, in decode
[2017-12-22 11:29:47,052] {base_task_runner.py:98} INFO - Subtask:     raise ValueError(errmsg("Extra data", s, end, len(s)))
[2017-12-22 11:29:47,052] {base_task_runner.py:98} INFO - Subtask: ValueError: Extra data: line 1 column 5 - line 2 column 1 (char 4 - 19)

But the test is still reported as succeeded in Gubernator. The junit file
https://storage.googleapis.com/kubernetes-jenkins/logs/tf-k8s-periodic/174/artifacts/junit_gpu-tests.xml
exists and reports success.

So there are two issues here

The TfJob isn't successfully created
The failure isn't properly reported.

The text was updated successfully, but these errors were encountered:

jlewi · 2017-12-22T13:48:45Z

Here's the value of /var/lib/data/runs/tf_k8s_tests/2017-12-22T11_23_38/tensorflow_k8s/examples/tf_job_gpu.yaml

apiVersion: "tensorflow.org/v1alpha1"
kind: "TfJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  replicaSpecs:
    - tfReplicaType: MASTER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample_gpu:dc944ff
              name: tensorflow
              resources:
                limits:
                  nvidia.com/gpu: 1
          restartPolicy: OnFailure

jlewi · 2017-12-22T14:25:02Z

When I manually ran the step to create the cluster I noticed that it completed before the controller pod had started

LAST DEPLOYED: Fri Dec 22 05:58:28 2017
NAMESPACE: default
STATUS: DEPLOYED

RESOURCES:
==> v1/ServiceAccount
NAME             SECRETS  AGE
tf-job-operator  1        2s

==> v1beta1/ClusterRole
NAME             AGE
tf-job-operator  2s

==> v1beta1/ClusterRoleBinding
NAME             AGE
tf-job-operator  2s

==> v1beta1/Deployment
NAME             DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE
tf-job-operator  1        1        1           0          2s

==> v1/Pod(related)
NAME                             READY  STATUS             RESTARTS  AGE
tf-job-operator-ccdd9c97d-wsvsh  0/1    ContainerCreating  0         2s

==> v1/ConfigMap
NAME                    DATA  AGE
tf-job-operator-config  1     2s

So I'm guessing we try to create the TfJob before the CRD is actually created and this is why we get a 404.

Potential fixes

Setup cluster should wait for resource to be TfJob CRD to be created
Add retry to job submission

The setup cluster step should wait for the TfJob operator deployment to be ready. Ensure that all exceptions result in a failure message being reported to Gubernator. Upgrade and fix issues with Kubernetes py client 4.0.0; Fixes #242 Bugs with gpu_test Fix #240

* We need to stop pinning GKE version to 1.8.5 because that is no longer a valid version. * We should no longer need to pin because 1.8 is now the default. * Fix some lint issues that seem to have crept in. Fix kubeflow#240

* We need to stop pinning GKE version to 1.8.5 because that is no longer a valid version. * We should no longer need to pin because 1.8 is now the default. * Fix some lint issues that seem to have crept in. Fix #240

jlewi mentioned this issue Dec 22, 2017

Fix issues with tf_job_gpu test #241

Merged

jlewi closed this as completed in #241 Dec 22, 2017

jlewi mentioned this issue Feb 28, 2018

Fix setup_cluster. #421

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU test marked as succeeded but airflow step is failing #240

GPU test marked as succeeded but airflow step is failing #240

jlewi commented Dec 22, 2017

jlewi commented Dec 22, 2017

jlewi commented Dec 22, 2017

GPU test marked as succeeded but airflow step is failing #240

GPU test marked as succeeded but airflow step is failing #240

Comments

jlewi commented Dec 22, 2017

jlewi commented Dec 22, 2017

jlewi commented Dec 22, 2017