Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU test marked as succeeded but airflow step is failing #240

Closed
jlewi opened this issue Dec 22, 2017 · 2 comments · Fixed by #241
Closed

GPU test marked as succeeded but airflow step is failing #240

jlewi opened this issue Dec 22, 2017 · 2 comments · Fixed by #241

Comments

@jlewi
Copy link
Contributor

jlewi commented Dec 22, 2017

In our Airflow graph, the run_gpu_test isn't succeeding.

[2017-12-22 11:29:47,048] {base_task_runner.py:98} INFO - Subtask: INFO:root:Loading spec from /var/lib/data/runs/tf_k8s_tests/2017-12-22T11_23_38/tensorflow_k8s/examples/tf_job_gpu.yaml with image_tag=notag
[2017-12-22 11:29:47,048] {base_task_runner.py:98} INFO - Subtask: ERROR:root:Exception when calling DefaultApi->apis_fqdn_v1_namespaces_namespace_resource_post. body: 404 page not found
[2017-12-22 11:29:47,049] {base_task_runner.py:98} INFO - Subtask: 
[2017-12-22 11:29:47,049] {base_task_runner.py:98} INFO - Subtask: INFO:root:Creationg gs://kubernetes-jenkins/logs/tf-k8s-periodic/174/artifacts/junit_gpu-tests.xml
[2017-12-22 11:29:47,049] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2017-12-22 11:29:47,049] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/runpy.py", line 174, in _run_module_as_main
[2017-12-22 11:29:47,049] {base_task_runner.py:98} INFO - Subtask:     "__main__", fname, loader, pkg_name)
[2017-12-22 11:29:47,049] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/runpy.py", line 72, in _run_code
[2017-12-22 11:29:47,050] {base_task_runner.py:98} INFO - Subtask:     exec code in run_globals
[2017-12-22 11:29:47,050] {base_task_runner.py:98} INFO - Subtask:   File "/var/lib/data/runs/tf_k8s_tests/2017-12-22T11_23_38/tensorflow_k8s/py/test_runner.py", line 140, in <module>
[2017-12-22 11:29:47,050] {base_task_runner.py:98} INFO - Subtask:     main()
[2017-12-22 11:29:47,050] {base_task_runner.py:98} INFO - Subtask:   File "/var/lib/data/runs/tf_k8s_tests/2017-12-22T11_23_38/tensorflow_k8s/py/test_runner.py", line 137, in main
[2017-12-22 11:29:47,050] {base_task_runner.py:98} INFO - Subtask:     args.func(args)
[2017-12-22 11:29:47,051] {base_task_runner.py:98} INFO - Subtask:   File "/var/lib/data/runs/tf_k8s_tests/2017-12-22T11_23_38/tensorflow_k8s/py/test_runner.py", line 48, in run_test
[2017-12-22 11:29:47,051] {base_task_runner.py:98} INFO - Subtask:     api_response = tf_job_client.create_tf_job(api_client, spec)
[2017-12-22 11:29:47,051] {base_task_runner.py:98} INFO - Subtask:   File "/var/lib/data/runs/tf_k8s_tests/2017-12-22T11_23_38/tensorflow_k8s/py/tf_job_client.py", line 39, in create_tf_job
[2017-12-22 11:29:47,051] {base_task_runner.py:98} INFO - Subtask:     body = json.loads(e.body)
[2017-12-22 11:29:47,052] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/json/__init__.py", line 339, in loads
[2017-12-22 11:29:47,052] {base_task_runner.py:98} INFO - Subtask:     return _default_decoder.decode(s)
[2017-12-22 11:29:47,052] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/json/decoder.py", line 367, in decode
[2017-12-22 11:29:47,052] {base_task_runner.py:98} INFO - Subtask:     raise ValueError(errmsg("Extra data", s, end, len(s)))
[2017-12-22 11:29:47,052] {base_task_runner.py:98} INFO - Subtask: ValueError: Extra data: line 1 column 5 - line 2 column 1 (char 4 - 19)

But the test is still reported as succeeded in Gubernator. The junit file
https://storage.googleapis.com/kubernetes-jenkins/logs/tf-k8s-periodic/174/artifacts/junit_gpu-tests.xml
exists and reports success.

So there are two issues here

  1. The TfJob isn't successfully created
  2. The failure isn't properly reported.
@jlewi
Copy link
Contributor Author

jlewi commented Dec 22, 2017

Here's the value of /var/lib/data/runs/tf_k8s_tests/2017-12-22T11_23_38/tensorflow_k8s/examples/tf_job_gpu.yaml

apiVersion: "tensorflow.org/v1alpha1"
kind: "TfJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  replicaSpecs:
    - tfReplicaType: MASTER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample_gpu:dc944ff
              name: tensorflow
              resources:
                limits:
                  nvidia.com/gpu: 1
          restartPolicy: OnFailure

@jlewi
Copy link
Contributor Author

jlewi commented Dec 22, 2017

When I manually ran the step to create the cluster I noticed that it completed before the controller pod had started

LAST DEPLOYED: Fri Dec 22 05:58:28 2017
NAMESPACE: default
STATUS: DEPLOYED

RESOURCES:
==> v1/ServiceAccount
NAME             SECRETS  AGE
tf-job-operator  1        2s

==> v1beta1/ClusterRole
NAME             AGE
tf-job-operator  2s

==> v1beta1/ClusterRoleBinding
NAME             AGE
tf-job-operator  2s

==> v1beta1/Deployment
NAME             DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE
tf-job-operator  1        1        1           0          2s

==> v1/Pod(related)
NAME                             READY  STATUS             RESTARTS  AGE
tf-job-operator-ccdd9c97d-wsvsh  0/1    ContainerCreating  0         2s

==> v1/ConfigMap
NAME                    DATA  AGE
tf-job-operator-config  1     2s

So I'm guessing we try to create the TfJob before the CRD is actually created and this is why we get a 404.

Potential fixes

  • Setup cluster should wait for resource to be TfJob CRD to be created
  • Add retry to job submission

jlewi added a commit that referenced this issue Dec 22, 2017
The setup cluster step should wait for the TfJob operator deployment to
be ready.

Ensure that all exceptions result in a failure message being reported to Gubernator.

Upgrade and fix issues with Kubernetes py client 4.0.0; Fixes #242

Bugs with gpu_test Fix #240
jlewi added a commit to jlewi/k8s that referenced this issue Feb 28, 2018
* We need to stop pinning GKE version to 1.8.5 because that is no longer
  a valid version.

* We should no longer need to pin because 1.8 is now the default.

* Fix some lint issues that seem to have crept in.

Fix kubeflow#240
jlewi added a commit that referenced this issue Feb 28, 2018
* We need to stop pinning GKE version to 1.8.5 because that is no longer
  a valid version.

* We should no longer need to pin because 1.8 is now the default.

* Fix some lint issues that seem to have crept in.

Fix #240
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant