openshift-sdn failed health check with "link not found" on GCP, failed test #18317

smarterclayton · 2018-01-28T00:49:21Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18229/test_pull_request_origin_extended_conformance_gce/15151/#conformanceareanetworkingfeaturerouter-openshift-routers-the-haproxy-router-should-serve-the-correct-routes-when-scoped-to-a-single-namespace-and-label-set-suiteopenshiftconformanceparallel

exec pod was trying to curl the router (which was up) but wasn't able to create a connection.

/tmp/openshift/build-rpms/rpm/BUILD/origin-3.9.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/scoped.go:42
Expected error:
    <*errors.errorString | 0xc420bf1170>: {
        s: "last response from server was not 200:\n",
    }
    last response from server was not 200:
    
not to have occurred
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.9.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/scoped.go:79

Router was up a few seconds after the exec pod was created, but it was never able to connect to the destination pod:

Jan 27 23:50:48.628: INFO: Running '/data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/kubectl --server=https://internal-api.prtest-5a37c28-15151.origin-ci-int-gce.dev.rhcloud.com:8443 --kubeconfig=/tmp/cluster-admin.kubeconfig exec --namespace=extended-test-scoped-router-s9pmd-2sj74 execpod -- /bin/sh -c 
		set -e
		for i in $(seq 1 180); do
			code=$( curl -k -s -o /dev/null -w '%{http_code}\n' --header 'Host: 172.16.2.45' "http://172.16.2.45:1936/healthz" ) || rc=$?
			if [[ "${rc:-0}" -eq 0 ]]; then
				echo $code
				if [[ $code -eq 200 ]]; then
					exit 0
				fi
				if [[ $code -ne 503 ]]; then
					exit 1
				fi
			else
				echo "error ${rc}" 1>&2
			fi
			sleep 1
		done
		'
Jan 27 23:53:53.739: INFO: stderr: "error 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\nerror 7\n"

Logs from router

Jan 27 23:53:53.828: INFO: Scoped Router test should serve the correct routes when scoped to a single namespace and label set [Suite:openshift/conformance/parallel] - waiting for the healthz endpoint to respond logs:
 I0127 23:51:07.732698       1 template.go:260] Starting template router (v3.9.0-alpha.3+5ec2b2f-218)
I0127 23:51:07.733031       1 merged_client_builder.go:123] Using in-cluster configuration
I0127 23:51:07.737366       1 reflector.go:202] Starting reflector *core.Service (10m0s) from github.com/openshift/origin/pkg/router/template/service_lookup.go:32
I0127 23:51:07.737420       1 reflector.go:240] Listing and watching *core.Service from github.com/openshift/origin/pkg/router/template/service_lookup.go:32
I0127 23:51:07.739146       1 router.go:154] Creating a new template router, writing to /var/lib/haproxy/router
I0127 23:51:07.739275       1 router.go:228] Template router will coalesce reloads within 5s of each other
I0127 23:51:07.739318       1 router.go:278] Router default cert from router container
I0127 23:51:07.739327       1 router.go:215] Reading persisted state
I0127 23:51:07.739355       1 router.go:219] Committing state
I0127 23:51:07.739363       1 router.go:333] Writing the router state
I0127 23:51:07.739855       1 router.go:340] Writing the router config
I0127 23:51:07.741933       1 router.go:354] Reloading the router
E0127 23:51:07.760463       1 reflector.go:205] github.com/openshift/origin/pkg/router/template/service_lookup.go:32: Failed to list *core.Service: services is forbidden: User "system:serviceaccount:extended-test-scoped-router-s9pmd-2sj74:default" cannot list services at the cluster scope: User "system:serviceaccount:extended-test-scoped-router-s9pmd-2sj74:default" cannot list all services in the cluster
I0127 23:51:07.816597       1 router.go:441] Router reloaded:
 - Checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).

The s4sd node ovs health check failed immediately after the router started:

Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd origin-node[2074]: F0127 23:50:31.053104    2074 healthcheck.go:96] SDN healthcheck detected unhealthy OVS server, restarting: Link not found
Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: Unit origin-node.service entered failed state.
Jan 27 23:50:31 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: origin-node.service failed.
Jan 27 23:50:36 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: origin-node.service holdoff time over, scheduling restart.
Jan 27 23:50:36 ci-prtest-5a37c28-15151-ig-n-s4sd systemd[1]: Starting OpenShift Node...

@openshift/sig-networking @openshift/networking

The text was updated successfully, but these errors were encountered:

smarterclayton · 2018-01-28T00:53:19Z

Is this the health check being too aggressive, or a symptom of a bigger problem on the node?

smarterclayton · 2018-01-30T01:26:05Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18331/test_pull_request_origin_extended_conformance_gce/15216/

on node ci-prtest-5a37c28-15216-ig-n-t1xb

smarterclayton · 2018-01-30T01:29:20Z

Moved to https://bugzilla.redhat.com/show_bug.cgi?id=1539987

Automatic merge from submit-queue (batch tested with PRs 18376, 18355). Move pod-namespace calls out of process As discussed in #15991, we need to move all operations in the pod's network namespace out of process, due to a golang issue that allows setns() calls in a locked thread to leak into other threads, causing random lossage as operations intended for the main network namespace end up running in other namespaces instead. (This is fixed in golang 1.10 but we need a fix before then.) Fixes #15991 Fixes #14385 Fixes #13108 Fixes #18317

smarterclayton added kind/bug Categorizes issue or PR as related to a bug. component/networking kind/test-flake Categorizes issue or PR as related to test flakes. sig/networking labels Jan 28, 2018

smarterclayton assigned knobunc Jan 28, 2018

smarterclayton changed the title ~~Connectivity never established between exec pod and another pod in namespace~~ openshift-sdn failed health check with "link down" Jan 28, 2018

smarterclayton added the priority/P1 label Jan 28, 2018

smarterclayton changed the title ~~openshift-sdn failed health check with "link down"~~ openshift-sdn failed health check with "link down" on GCP, failed test Jan 28, 2018

smarterclayton mentioned this issue Jan 29, 2018

[release-3.8] Prometheus scrape is 60s, so ensure we see at least one #18321

Merged

jsafrane mentioned this issue Feb 1, 2018

UPSTREAM: 57967: Fixed TearDown of NFS with root squash. #18154

Merged

danwinship mentioned this issue Feb 1, 2018

Move pod-namespace calls out of process #18355

Merged

This was referenced Feb 2, 2018

Prometheus when installed to the cluster should start and expose a secured proxy and unsecured metrics #17529

Closed

The HAProxy router should expose prometheus metrics for a route" flakes with error "timed out waiting for the condition" #17731

Closed

danwinship changed the title ~~openshift-sdn failed health check with "link down" on GCP, failed test~~ openshift-sdn failed health check with "link not found" on GCP, failed test Feb 2, 2018

openshift-merge-robot closed this as completed in #18355 Feb 2, 2018

zgalor mentioned this issue Feb 12, 2018

[Feature:Prometheus][Conformance] Prometheus when installed to the cluster should start and expose a secured proxy and unsecured metrics [Suite:openshift/conformance/parallel] 1m51s #17901

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openshift-sdn failed health check with "link not found" on GCP, failed test #18317

openshift-sdn failed health check with "link not found" on GCP, failed test #18317

smarterclayton commented Jan 28, 2018 •

edited

Loading

smarterclayton commented Jan 28, 2018

smarterclayton commented Jan 30, 2018

smarterclayton commented Jan 30, 2018

openshift-sdn failed health check with "link not found" on GCP, failed test #18317

openshift-sdn failed health check with "link not found" on GCP, failed test #18317

Comments

smarterclayton commented Jan 28, 2018 • edited Loading

smarterclayton commented Jan 28, 2018

smarterclayton commented Jan 30, 2018

smarterclayton commented Jan 30, 2018

smarterclayton commented Jan 28, 2018 •

edited

Loading