Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openshift-sdn failed health check with "link not found" on GCP, failed test #18317

Closed
smarterclayton opened this issue Jan 28, 2018 · 3 comments · Fixed by #18355
Closed

openshift-sdn failed health check with "link not found" on GCP, failed test #18317

smarterclayton opened this issue Jan 28, 2018 · 3 comments · Fixed by #18355
Assignees
Labels
component/networking kind/bug Categorizes issue or PR as related to a bug. kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1 sig/networking

Comments

@smarterclayton
Copy link
Contributor

smarterclayton commented Jan 28, 2018

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18229/test_pull_request_origin_extended_conformance_gce/15151/#conformanceareanetworkingfeaturerouter-openshift-routers-the-haproxy-router-should-serve-the-correct-routes-when-scoped-to-a-single-namespace-and-label-set-suiteopenshiftconformanceparallel

exec pod was trying to curl the router (which was up) but wasn't able to create a connection.

/tmp/openshift/build-rpms/rpm/BUILD/origin-3.9.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/scoped.go:42
Expected error:
    <*errors.errorString | 0xc420bf1170>: {
        s: "last response from server was not 200:\n",
    }
    last response from server was not 200:
    
not to have occurred
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.9.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/scoped.go:79

Router was up a few seconds after the exec pod was created, but it was never able to connect to the destination pod:

Jan 27 23:50:48.628: INFO: Running '/data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/kubectl --server=https://internal-api.prtest-5a37c28-15151.origin-ci-int-gce.dev.rhcloud.com:8443 --kubeconfig=/tmp/cluster-admin.kubeconfig exec --namespace=extended-test-scoped-router-s9pmd-2sj74 execpod -- /bin/sh -c 
		set -e
		for i in $(seq 1 180); do
			code=$( curl -k -s -o /dev/null -w '%{http_code}\n' --header 'Host: 172.16.2.45' "http://172.16.2.45:1936/healthz" ) || rc=$?
			if [[ "${rc:-0}" -eq 0 ]]; then
				echo $code
				if [[ $code -eq 200 ]]; then
					exit 0
				fi
				if [[ $code -ne 503 ]]; then
					exit 1
				fi
			else
				echo "error ${rc}" 1>&2
			fi
			sleep 1
		done
		'
Jan 27 23:53:53.739: INFO: stderr: "error 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\n"

Logs from router

Jan 27 23:53:53.828: INFO: Scoped Router test should serve the correct routes when scoped to a single namespace and label set [Suite:openshift/conformance/parallel] - waiting for the healthz endpoint to respond logs:
 I0127 23:51:07.732698       1 template.go:260] Starting template router (v3.9.0-alpha.3+5ec2b2f-218)
I0127 23:51:07.733031       1 merged_client_builder.go:123] Using in-cluster configuration
I0127 23:51:07.737366       1 reflector.go:202] Starting reflector *core.Service (10m0s) from github.com/openshift/origin/pkg/router/template/service_lookup.go:32
I0127 23:51:07.737420       1 reflector.go:240] Listing and watching *core.Service from github.com/openshift/origin/pkg/router/template/service_lookup.go:32
I0127 23:51:07.739146       1 router.go:154] Creating a new template router, writing to /var/lib/haproxy/router
I0127 23:51:07.739275       1 router.go:228] Template router will coalesce reloads within 5s of each other
I0127 23:51:07.739318       1 router.go:278] Router default cert from router container
I0127 23:51:07.739327       1 router.go:215] Reading persisted state
I0127 23:51:07.739355       1 router.go:219] Committing state
I0127 23:51:07.739363       1 router.go:333] Writing the router state
I0127 23:51:07.739855       1 router.go:340] Writing the router config
I0127 23:51:07.741933       1 router.go:354] Reloading the router
E0127 23:51:07.760463       1 reflector.go:205] github.com/openshift/origin/pkg/router/template/service_lookup.go:32: Failed to list *core.Service: services is forbidden: User "system:serviceaccount:extended-test-scoped-router-s9pmd-2sj74:default" cannot list services at the cluster scope: User "system:serviceaccount:extended-test-scoped-router-s9pmd-2sj74:default" cannot list all services in the cluster
I0127 23:51:07.816597       1 router.go:441] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).

The s4sd node ovs health check failed immediately after the router started:

Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd origin-node[2074]: F0127 23:50:31.053104    2074 healthcheck.go:96] SDN healthcheck detected unhealthy OVS server, restarting: Link not found
Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: Unit origin-node.service entered failed state.
Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: origin-node.service failed.
Jan 27 23:50:36 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: origin-node.service holdoff time over, scheduling restart.
Jan 27 23:50:36 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: Starting OpenShift Node...

@openshift/sig-networking @openshift/networking

@smarterclayton smarterclayton added kind/bug Categorizes issue or PR as related to a bug. component/networking kind/test-flake Categorizes issue or PR as related to test flakes. sig/networking labels Jan 28, 2018
@smarterclayton smarterclayton changed the title Connectivity never established between exec pod and another pod in namespace openshift-sdn failed health check with "link down" Jan 28, 2018
@smarterclayton
Copy link
Contributor Author

Is this the health check being too aggressive, or a symptom of a bigger problem on the node?

@smarterclayton smarterclayton changed the title openshift-sdn failed health check with "link down" openshift-sdn failed health check with "link down" on GCP, failed test Jan 28, 2018
@smarterclayton
Copy link
Contributor Author

@smarterclayton
Copy link
Contributor Author

@danwinship danwinship changed the title openshift-sdn failed health check with "link down" on GCP, failed test openshift-sdn failed health check with "link not found" on GCP, failed test Feb 2, 2018
openshift-merge-robot added a commit that referenced this issue Feb 2, 2018
Automatic merge from submit-queue (batch tested with PRs 18376, 18355).

Move pod-namespace calls out of process

As discussed in #15991, we need to move all operations in the pod's network namespace out of process, due to a golang issue that allows setns() calls in a locked thread to leak into other threads, causing random lossage as operations intended for the main network namespace end up running in other namespaces instead. (This is fixed in golang 1.10 but we need a fix before then.)

Fixes #15991
Fixes #14385
Fixes #13108
Fixes #18317
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/networking kind/bug Categorizes issue or PR as related to a bug. kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1 sig/networking
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants