Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Test Flake] tf_job_simple_test needs retries for ks init to deal with git connection issues #1128

Closed
jlewi opened this issue Jul 5, 2018 · 2 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jul 5, 2018

See stack trace below. The test failed because ks init had a connection error trying to read from
git. We should have retries to deal with this.

W
ERROR|2018-07-05T06:16:46|test_helper.py:98| Subprocess failed;
level=info msg="Using context \"gke_kubeflow-ci_us-east1-d_e2e-834a\" from kubeconfig file \"/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-110
9-866b43a-2372-834a/.kube/config\""
level=error msg="unable to find SHA1 for repo: Get https://api.github.com/repos/ksonnet/parts/commits/master: read tcp 10.36.0.175:60958->192.30.253.116:4
43: read: connection reset by peer"
Traceback (most recent call last):
 File "/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1109-866b43a-2372-834a/src/kubeflow/testing/py/kubeflow/testing/test_helper.py", line 96,
 in wrap_test
   test_case.test_func(test_case)
 File "/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1109-866b43a-2372-834a/src/kubeflow/kubeflow/testing/tf_job_simple_test.py", line 72, in
test_tf_job_simple
   util.run(["ks", "init", "tf-job-simple-app"])
 File "/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1109-866b43a-2372-834a/src/kubeflow/testing/py/kubeflow/testing/util.py", line 74, in run
   " ".join(command), process.returncode), "\n".join(output))
CalledProcessError: Command 'cmd: ks init tf-job-simple-app exited with code 1' returned non-zero exit status 1
Traceback (most recent call last):
 File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
   "__main__", fname, loader, pkg_name)
 File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
   exec code in run_globals
 File "/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1109-866b43a-2372-834a/src/kubeflow/kubeflow/testing/tf_job_simple_test.py", line 99, in
<module>
   test_suite.run()
 File "/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1109-866b43a-2372-834a/src/kubeflow/testing/py/kubeflow/testing/test_helper.py", line 73,
 in run
   wrap_test(test_case)
 File "/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1109-866b43a-2372-834a/src/kubeflow/testing/py/kubeflow/testing/test_helper.py", line 96,
 in wrap_test
   test_case.test_func(test_case)
 File "/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1109-866b43a-2372-834a/src/kubeflow/kubeflow/testing/tf_job_simple_test.py", line 72, in
test_tf_job_simple
   util.run(["ks", "init", "tf-job-simple-app"])
 File "/mnt/test-data-volume/kubeflow-presubmit-kubeflow-e2e-gke-1109-866b43a-2372-834a/src/kubeflow/testing/py/kubeflow/testing/util.py", line 74, in run
   " ".join(command), process.returncode), "\n".join(output))
subprocess.CalledProcessError: Command 'cmd: ks init tf-job-simple-app exited with code 1' returned non-zero exit status 1
@pdmack
Copy link
Member

pdmack commented Jul 5, 2018

@jlewi I think we should consider adding --skip-default-registries to ks init to avoid hitting the ksonnet repo in the test workflows. These are ephemeral apps anyway and it's not clear to me that fetching the ksonnet incubator SHA is actually necessary to launching kubeflow. The ks "bits" are in the lib dir AFAICT.

DEBUG github: fetching SHA1 for ksonnet/parts - master
DEBUG github: fetching contents for ksonnet/parts/incubator/registry.yaml - 40285d8a14f1ac5787e405e1023cf0c07f6aa28c

@jlewi
Copy link
Contributor Author

jlewi commented Jul 6, 2018

Good idea.

jlewi added a commit to jlewi/kubeflow that referenced this issue Jul 6, 2018
…ksonnet app.

* Skip install the default registries because we don't need them and talking
  to Git just creates a source of flakiness.

* Add retries to setting up the ksonnet app.

Fix kubeflow#1128
jlewi added a commit to jlewi/kubeflow that referenced this issue Jul 6, 2018
* Make the tf_job_simple test robust to test flakes due to problems initializing the ksonnet app.

* Skip install the default registries because we don't need them and talking
  to Git just creates a source of flakiness.

* Add retries to setting up the ksonnet app.

Fix kubeflow#1128

* Fix flake in minikube test

wait_for_operation should just rety on socket errors.

Fix kubeflow#1137
k8s-ci-robot pushed a commit that referenced this issue Jul 7, 2018
…ksonnet app (#1133)

* Fix issues causing test flakes.

* Make the tf_job_simple test robust to test flakes due to problems initializing the ksonnet app.

* Skip install the default registries because we don't need them and talking
  to Git just creates a source of flakiness.

* Add retries to setting up the ksonnet app.

Fix #1128

* Fix flake in minikube test

wait_for_operation should just rety on socket errors.

Fix #1137

* Fix incorrect import; sock -> socket.
saffaalvi pushed a commit to StatCan/kubeflow that referenced this issue Feb 11, 2021
…ksonnet app (kubeflow#1133)

* Fix issues causing test flakes.

* Make the tf_job_simple test robust to test flakes due to problems initializing the ksonnet app.

* Skip install the default registries because we don't need them and talking
  to Git just creates a source of flakiness.

* Add retries to setting up the ksonnet app.

Fix kubeflow#1128

* Fix flake in minikube test

wait_for_operation should just rety on socket errors.

Fix kubeflow#1137

* Fix incorrect import; sock -> socket.
surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants