Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[builds][Slow] openshift pipeline build Pipeline with maven slave [It] should build and complete successfully #13984

Closed
coreydaley opened this issue May 1, 2017 · 24 comments
Assignees
Labels
kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1

Comments

@coreydaley
Copy link
Member

coreydaley commented May 1, 2017

[Fail] [builds][Slow] openshift pipeline build Pipeline with maven slave [It] should build and complete successfully
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/builds/pipeline.go:149

[INFO] Building war: /tmp/workspace/extended-test-jenkins-pipeline-0hvp5-dpbwt-openshift-jee-sample/target/ROOT.war
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 5.913 s
[INFO] Finished at: 2017-05-01T19:01:55+00:00
[INFO] Final Memory: 14M/202M
[INFO] ------------------------------------------------------------------------
[Pipeline] stash
Stashed 1 file(s)
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] node
Running on master in /var/lib/jenkins/jobs/extended-test-jenkins-pipeline-0hvp5-dpbwt-openshift-jee-sample/workspace
[Pipeline] {
[Pipeline] stage
[Pipeline] { (Build Image)
[Pipeline] unstash
[Pipeline] sh
[workspace] Running shell script
+ oc start-build openshift-jee-sample-docker --from-file=target/ROOT.war --follow -n extended-test-jenkins-pipeline-0hvp5-dpbwt
Uploading file "target/ROOT.war" as binary input for the build ...
Unable to connect to the server: net/http: TLS handshake timeout

https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended/285/consoleFull#-86936742456bf4006e4b05b79524e5923

I was unable to reproduce locally.

@coreydaley coreydaley added kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1 labels May 1, 2017
@stevekuznetsov
Copy link
Contributor

Dupe of #13983

@stevekuznetsov
Copy link
Contributor

/cc @gabemontero

@stevekuznetsov stevekuznetsov reopened this May 2, 2017
@gabemontero gabemontero self-assigned this May 2, 2017
@gabemontero
Copy link
Contributor

gabemontero commented May 2, 2017

OK, grepping on the build config openshift-jee-sample-docker in the master log, there are only gets on the bc, which I think can all be attributed to config_controller.go, and then the delete of the resource when the test ends and the namespace is cleaned up:

openshift.log:124420:I0501 14:59:53.300932    4209 config_controller.go:47] Handling BuildConfig extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:124421:I0501 14:59:53.300944    4209 config_controller.go:49] Handling Build Pruning: extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:124422:I0501 14:59:53.301549    4209 audit.go:133] 2017-05-01T14:59:53.30151033-04:00 AUDIT: id="d54ff94f-a541-497b-805d-23bbb6954c76" ip="172.18.0.216" method="GET" user="system:openshift-master" groups="\"system:masters\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="extended-test-jenkins-pipeline-0hvp5-dpbwt" uri="/oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker"
openshift.log:124424:I0501 14:59:53.302620    4209 wrap.go:75] GET /oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker: (1.36226ms) 200 [[openshift/v3.6.0 (linux/amd64) openshift/b48cd3c] 172.18.0.216:34570]
openshift.log:125650:I0501 15:00:04.706976    4209 config_controller.go:47] Handling BuildConfig extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:125651:I0501 15:00:04.706985    4209 config_controller.go:49] Handling Build Pruning: extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:125652:I0501 15:00:04.707410    4209 audit.go:133] 2017-05-01T15:00:04.707385112-04:00 AUDIT: id="736923c1-92ce-4ed9-9ff9-99e39c6cf5bc" ip="172.18.0.216" method="GET" user="system:openshift-master" groups="\"system:masters\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="extended-test-jenkins-pipeline-0hvp5-dpbwt" uri="/oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker"
openshift.log:125654:I0501 15:00:04.708287    4209 wrap.go:75] GET /oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker: (1.08399ms) 200 [[openshift/v3.6.0 (linux/amd64) openshift/b48cd3c] 172.18.0.216:34570]
openshift.log:130358:I0501 15:02:04.707679    4209 config_controller.go:47] Handling BuildConfig extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:130359:I0501 15:02:04.707688    4209 config_controller.go:49] Handling Build Pruning: extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:130360:I0501 15:02:04.708148    4209 audit.go:133] 2017-05-01T15:02:04.708097065-04:00 AUDIT: id="2470851f-0434-4a0d-b65c-190aaae61934" ip="172.18.0.216" method="GET" user="system:openshift-master" groups="\"system:masters\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="extended-test-jenkins-pipeline-0hvp5-dpbwt" uri="/oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker"
openshift.log:130362:I0501 15:02:04.709009    4209 wrap.go:75] GET /oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker: (1.094511ms) 200 [[openshift/v3.6.0 (linux/amd64) openshift/b48cd3c] 172.18.0.216:34570]
openshift.log:135066:I0501 15:04:04.707574    4209 config_controller.go:47] Handling BuildConfig extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:135067:I0501 15:04:04.707586    4209 config_controller.go:49] Handling Build Pruning: extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:135068:I0501 15:04:04.708042    4209 audit.go:133] 2017-05-01T15:04:04.7080175-04:00 AUDIT: id="fef275b0-9866-4748-bb9d-f999a4bb382d" ip="172.18.0.216" method="GET" user="system:openshift-master" groups="\"system:masters\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="extended-test-jenkins-pipeline-0hvp5-dpbwt" uri="/oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker"
openshift.log:135070:I0501 15:04:04.708870    4209 wrap.go:75] GET /oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker: (1.030516ms) 200 [[openshift/v3.6.0 (linux/amd64) openshift/b48cd3c] 172.18.0.216:34570]
openshift.log:138895:I0501 15:06:04.707571    4209 config_controller.go:47] Handling BuildConfig extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:138896:I0501 15:06:04.707579    4209 config_controller.go:49] Handling Build Pruning: extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:138897:I0501 15:06:04.707985    4209 audit.go:133] 2017-05-01T15:06:04.70796178-04:00 AUDIT: id="bb43c1f0-f38b-4c34-8552-982a6d34601b" ip="172.18.0.216" method="GET" user="system:openshift-master" groups="\"system:masters\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="extended-test-jenkins-pipeline-0hvp5-dpbwt" uri="/oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker"
openshift.log:138899:I0501 15:06:04.708894    4209 wrap.go:75] GET /oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker: (1.098852ms) 200 [[openshift/v3.6.0 (linux/amd64) openshift/b48cd3c] 172.18.0.216:34570]
openshift.log:140900:I0501 15:07:12.720613    4209 audit.go:133] 2017-05-01T15:07:12.720588278-04:00 AUDIT: id="14a8dcd1-8eb2-410a-b188-f82f849f6beb" ip="172.18.0.216" method="DELETE" user="system:openshift-master" groups="\"system:masters\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="extended-test-jenkins-pipeline-0hvp5-dpbwt" uri="/oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker"
openshift.log:140903:I0501 15:07:12.722806    4209 wrap.go:75] DELETE /oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker: (2.37841ms) 200 [[openshift/v3.6.0 (linux/amd64) openshift/b48cd3c] 172.18.0.216:34570]

In other words, no POSTs that would have occured if the build config instantiatebinary endpoint had bubbled up to the master's audit.go or wrap.go level.

So things to pursue:

  1. is there lower level networking type trace possible on either client or master side within the context of our extended tests running on jenkins,
  2. traversing through the k8s rest client code, I see a very sparse sprinkling of glog.V(4) trace; we could try adding a --loglevel=10 to the oc start-build in the maven test, and see if that reveals anything

I'll start a PR for 2), and start asking around about 1)

@oatmealraisin
Copy link

Another failure in the extended tests today.

An excerpt from the build logs shows some networking issues

Cloning repository https://github.com/openshift/jenkins-client-plugin.git
 > git init /var/lib/jenkins/jobs/extended-test-jenkins-pipeline-3spsj-q825m-sample-pipeline-openshift-client-plugin/workspace@script # timeout=10
Fetching upstream changes from https://github.com/openshift/jenkins-client-plugin.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/openshift/jenkins-client-plugin.git +refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url https://github.com/openshift/jenkins-client-plugin.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/openshift/jenkins-client-plugin.git # timeout=10
Fetching upstream changes from https://github.com/openshift/jenkins-client-plugin.git
 > git fetch --tags --progress https://github.com/openshift/jenkins-client-plugin.git +refs/heads/*:refs/remotes/origin/*
 > git rev-parse origin/master^{commit} # timeout=10
Checking out Revision ba6a931f77107189a796dc966425b08f8e710d5a (origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f ba6a931f77107189a796dc966425b08f8e710d5a
First time build. Skipping changelog.
[Pipeline] echo

[Pipeline] node
Running on master in /var/lib/jenkins/jobs/extended-test-jenkins-pipeline-3spsj-q825m-sample-pipeline-openshift-client-plugin/workspace
[Pipeline] {
[Pipeline] _OcContextInit
[Pipeline] readFile
[Pipeline] readFile
[Pipeline] _OcAction
[describe] 
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
ERROR: Error during describe;
{reference={}, err=Unable to connect to the server: net/http: TLS handshake timeout, verb=describe, cmd=oc describe serviceaccount --server=https://172.30.0.1:443 --namespace=extended-test-jenkins-pipeline-3spsj-q825m --token=XXXXX , out=, status=1}

Finished: FAILURE
```Cloning repository https://github.com/openshift/jenkins-client-plugin.git
 > git init /var/lib/jenkins/jobs/extended-test-jenkins-pipeline-3spsj-q825m-sample-pipeline-openshift-client-plugin/workspace@script # timeout=10
Fetching upstream changes from https://github.com/openshift/jenkins-client-plugin.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/openshift/jenkins-client-plugin.git +refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url https://github.com/openshift/jenkins-client-plugin.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/openshift/jenkins-client-plugin.git # timeout=10
Fetching upstream changes from https://github.com/openshift/jenkins-client-plugin.git
 > git fetch --tags --progress https://github.com/openshift/jenkins-client-plugin.git +refs/heads/*:refs/remotes/origin/*
 > git rev-parse origin/master^{commit} # timeout=10
Checking out Revision ba6a931f77107189a796dc966425b08f8e710d5a (origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f ba6a931f77107189a796dc966425b08f8e710d5a
First time build. Skipping changelog.
[Pipeline] echo

[Pipeline] node
Running on master in /var/lib/jenkins/jobs/extended-test-jenkins-pipeline-3spsj-q825m-sample-pipeline-openshift-client-plugin/workspace
[Pipeline] {
[Pipeline] _OcContextInit
[Pipeline] readFile
[Pipeline] readFile
[Pipeline] _OcAction
[describe] 
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
ERROR: Error during describe;
{reference={}, err=Unable to connect to the server: net/http: TLS handshake timeout, verb=describe, cmd=oc describe serviceaccount --server=https://172.30.0.1:443 --namespace=extended-test-jenkins-pipeline-3spsj-q825m --token=XXXXX , out=, status=1}

Finished: FAILURE

The jenkins logs also has repeated connection exceptions thrown,

May 03, 2017 1:37:30 PM io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager nextReconnectInterval
INFO: Current reconnect backoff is 1000 milliseconds (T0)
May 03, 2017 1:37:30 PM io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager closeWebSocket
WARNING: Failed to write websocket close message: class java.net.SocketException Socket is closed
May 03, 2017 1:37:30 PM io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2 onFailure
WARNING: Exec Failure
java.net.SocketException: Connection closed by remote host
	at sun.security.ssl.SSLSocketImpl.checkWrite(SSLSocketImpl.java:1565)
	at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:71)
	at okio.Okio$1.write(Okio.java:78)
	at okio.AsyncTimeout$1.write(AsyncTimeout.java:180)
	at okio.RealBufferedSink.emit(RealBufferedSink.java:178)
	at okhttp3.internal.ws.WebSocketWriter.writeControlFrameSynchronized(WebSocketWriter.java:152)
	at okhttp3.internal.ws.WebSocketWriter.writePing(WebSocketWriter.java:79)
	at okhttp3.internal.ws.RealWebSocket.sendPing(RealWebSocket.java:143)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2$1.execute(WatchConnectionManager.java:171)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$NamedRunnable.run(WatchConnectionManager.java:424)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

@gabemontero
Copy link
Contributor

In chatting with @oatmealraisin and digesting the data he noted, of particular interest to me was:

ERROR: Error during describe;
{reference={}, err=Unable to connect to the server: net/http: TLS handshake timeout, verb=describe, cmd=oc describe serviceaccount --server=https://172.30.0.1:443 --namespace=extended-test-jenkins-pipeline-3spsj-q825m --token=XXXXX , out=, status=1}

Is exactly like the error seen previously with oc start-build, only with oc describe this time.

I agree the java side exceptions @oatmealraisin noted seem similar.

We'll have to consider when diving into this more whether this issue is somehow Jenkins/JVM specific, or if the Jenkins/JVM scenario is surfacing a new instability in openshift/origin.

@gabemontero
Copy link
Contributor

Possible correlation ... in debugging some new unit tests, if an api object was invalid, rather than getting a validation error, I got a tls handshake error:

--- FAIL: TestSetPipelineBuild (0.00s)
	logs_test.go:111: &{{ } {      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] } {{ <nil> <nil> <nil> {map[] map[]} map[] map[] <nil>} 0 [] 0 <nil> false false map[] <nil>} {0 0 0 0 0 0 <nil> [] 0}}, unexpected results isPipeline false JenkinsLogs true
2017-05-03 13:02:29.801786 I | http: TLS handshake error from 127.0.0.1:48424: EOF
2017-05-03 13:02:29.819717 I | http: TLS handshake error from 127.0.0.1:49948: EOF
Tag yourproject/rails:tip set to openshift/ruby:latest.
Tag yourproject/rails:tip set to openshift/ruby:2.0.
Deleted tag yourproject/rails:tip.
Tag yourproject/rails:tip set to openshift/ruby:2.0.
Tag yourproject/rails:tip set to openshift/ruby:2.0.
Deleted tag yourproject/rails:tip.
FAIL

I wonder if at least for the failed oc calls, if some other error is getting masked as tls handshake ....

@gabemontero
Copy link
Contributor

gabemontero commented May 4, 2017

I am seeing another flavor of this test failing. Specifically, it looks like the oc start-build --follow is not exiting because of the --follow. I've seen this when running oc start-build openshift-jee-sample-docker --follow manually. The build finished but the oc ... --follow does not exit and I have to control-c.

I've also reproduced this locally/manually in jenkins deployed in openshift. It is not everytime, but has happened more than once.

Also, on the point of "what might have changed" ... we did update the oc client in the jenkins image late last week: see openshift/jenkins@d48c23c

I wonder if that is at least part of the instability we've seen, at least in the cases where invocations of oc are involved (which includes our extended tests and the openshift-client plugin). It would exclude sync plugin / fabric8 ssl comm issues, and the master seeing readiness check failures for the jenkins pod.

@bparees wdyt ^^

UPDATE: switching --follow to --wait in the example maven pipeline template has the job passing consistently now in my local env (vs. the --follow version hanging at least half the time).

@gabemontero
Copy link
Contributor

Also note, oc logs really defers to k8s logs for the actual streaming, etc. It potentially could have been impacted by the rebase.

@stevekuznetsov
Copy link
Contributor

@ncdc could probably weigh in on kubectl logs and/or spdystream

@ncdc
Copy link
Contributor

ncdc commented May 5, 2017

We don't use spdy for retrieving logs. Only exec, attach, and portforward.

There is a kube PR to fix an issue where --follow wouldn't stop when the container exited, but as best I can tell, it only applies to the json-file docker log driver. kubernetes/kubernetes#44406.

If you start a pod that does something like print out the date every second for five or ten seconds before terminating, and then in another terminal run oc logs mypod -f, does it stop following, or does it hang waiting for more logs?

@gabemontero
Copy link
Contributor

gabemontero commented May 5, 2017 via email

@ncdc
Copy link
Contributor

ncdc commented May 5, 2017 via email

@gabemontero
Copy link
Contributor

Separate from the --follow item discussed here, on the debug front For the TLS issues I'm planning to

  • ramp the extended test scope ... the focused scope has been too successful
  • on the jenkins image commit I noted earlier, in addition to a bump in oc version there was a bump in jenkins version; assuming I can get at least intermittent failures I am the going to run with custom jenkins images that revert first jenkins, then oc, and see if that has bearing

@gabemontero
Copy link
Contributor

gabemontero commented May 5, 2017 via email

@gabemontero
Copy link
Contributor

@ncdc - here is what I believe is a simple pod example:
pod.json.zip

Details on my locally build origin:

gmontero ~/go/src/github.com/openshift/origin  (maven-ext-test-dbg)$ openshift version
openshift v3.6.0-alpha.1+999c737-400
kubernetes v1.6.1+5115d708d7
etcd 3.1.0
gmontero ~/go/src/github.com/openshift/origin  (maven-ext-test-dbg)$ oc version
oc v3.6.0-alpha.1+999c737-400
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://127.0.0.1:8443
openshift v3.6.0-alpha.1+999c737-400
kubernetes v1.6.1+5115d708d7
gmontero ~/go/src/github.com/openshift/origin  (maven-ext-test-dbg)$ 

This is at some level I built yesterday (with a debug commit on top).

This particular pod ends up error'ing out after running for a bit, but since I've seen this occur repeatedly with both pods that complete after a while and pods that error out after a while, I figured it wasn't worth debugging this sample further.

I'm going to rebase / rebuild my origin in a sec from latest master, to see if the error still occurs. I'll report the finding when I have them.

thanks

@gabemontero
Copy link
Contributor

@ncdc - hey, still happens for me after rebasing this moring.

version particulars:

gmontero ~/go/src/github.com/openshift/origin  (master)$ git log -3
commit b9dd637aaee53f47e9eded0d9b2e87d076766d93
Merge: 8f1797d 4c44fe4
Author: OpenShift Bot <[email protected]>
Date:   Fri May 5 09:51:13 2017 -0500

    Merge pull request #13932 from enj/enj/i/gssapi_dc_wait/13927
    
    Merged by openshift-bot

commit 8f1797d73a754e32eb9059aa428e823bd3472bf9
Merge: 076d231 1e93649
Author: OpenShift Bot <[email protected]>
Date:   Fri May 5 08:18:46 2017 -0500

    Merge pull request #14040 from bparees/defaulter
    
    Merged by openshift-bot

commit 076d2314b0296b6c0d486010ef6167cae6cec1b5
Merge: 009cbef 58868b5
Author: OpenShift Bot <[email protected]>
Date:   Fri May 5 04:37:21 2017 -0500

    Merge pull request #14034 from ncdc/do-not-show-policy-rules-with-attribute-restrictions
    
    Merged by openshift-bot
gmontero ~/go/src/github.com/openshift/origin  (master)$ openshift version
openshift v3.6.0-alpha.1+b9dd637-430
kubernetes v1.6.1+5115d708d7
etcd 3.1.0
gmontero ~/go/src/github.com/openshift/origin  (master)$ oc version
oc v3.6.0-alpha.1+b9dd637-430
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://127.0.0.1:8443
openshift v3.6.0-alpha.1+b9dd637-430
kubernetes v1.6.1+5115d708d7

And it is probably obvious, but to be clear, I ran:

  1. oc create -f pods.json
  2. oc logs -f pods/openshift-jee-sample-docker-1-build

thanks

@ncdc
Copy link
Contributor

ncdc commented May 5, 2017

@gabemontero I haven't tested with a build pod, but I have tested with a simple centos:centos7, and the oc logs -f command terminates as soon as the container exits.

@gabemontero
Copy link
Contributor

gabemontero commented May 5, 2017 via email

@gabemontero
Copy link
Contributor

gabemontero commented May 5, 2017 via email

@ncdc
Copy link
Contributor

ncdc commented May 5, 2017 via email

@gabemontero
Copy link
Contributor

gabemontero commented May 5, 2017

@ncdc here is the thread dump:
thread-dump.txt

Other than goroutine 72, everything else is blocked on IO or semaphores down in golang code.

The comment before the channel select in the apimachinery code was ominous at first glance:

		// NOTE: b/c there is no priority selection in golang
		// it is possible for this to race, meaning we could
		// trigger t.C and stopCh, and t.C select falls through.
		// In order to mitigate we re-check stopCh at the beginning
		// of every loop to prevent extra executions of f().

Though that logic has not changed in a while if I'm reading git blame correctly.

Here is my go version: go version go1.7.3 linux/amd64

first blush, smells like a golang issue

(admittedly novice) opinion, I would contend some defensive programming is in order, and that the call to io.Copy in k8s.io/kubernetes/pkg/kubectl/cmd.LogsOptions.getLogs should be coupled with a select/channel/timer so we don't block forever ... thoughts?

UPDATE (May 11) - upgraded to go1.7.5, no affect on my local runs of oc logs -f.

@gabemontero
Copy link
Contributor

Older PR test and overnight failures have been pruned, but between the debug runs in #14013, a recent overnight run that resulted in #14093,
and @coreydaley 's failure noted in #14084, I'm not seeing the common thread of volume mount issues. In the case of the client plugin, they result in the mongodb deployment failing. Here is a snippet from @coreydaley 's run cataloged in #14084:

May  5 15:48:46.203: INFO: At 2017-05-05 15:42:23.452467294 -0400 EDT - event for jenkins-1-deploy: {kubelet 172.18.0.186} FailedMount: Unable to mount volumes for pod "jenkins-1-deploy_extended-test-jenkins-pipeline-n2w9s-r8pf3(99bb5368-31ca-11e7-ab80-0e76db0f9e3c)": timeout expired waiting for volumes to attach/mount for pod "extended-test-jenkins-pipeline-n2w9s-r8pf3"/"jenkins-1-deploy". list of unattached/unmounted volumes=[deployer-token-0qmcm]
May  5 15:48:46.203: INFO: At 2017-05-05 15:42:23.452535031 -0400 EDT - event for jenkins-1-deploy: {kubelet 172.18.0.186} FailedSync: Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "extended-test-jenkins-pipeline-n2w9s-r8pf3"/"jenkins-1-deploy". list of unattached/unmounted volumes=[deployer-token-0qmcm]
May  5 15:48:46.203: INFO: At 2017-05-05 15:42:30.954662138 -0400 EDT - event for mongodb-1-77ttj: {kubelet 172.18.0.186} Started: Started container with id 428c475c64b8b41904775b28744ca7793090960feb08376ccfd54dd5f8a60677
May  5 15:48:46.203: INFO: At 2017-05-05 15:42:53.009580387 -0400 EDT - event for mongodb-1-deploy: {kubelet 172.18.0.186} Killing: Killing container with id docker://95b68fb656cf1c0c17e542ab73123629e84885caae3cc6133bcc673864b6b479:Need to kill Pod
May  5 15:48:46.203: INFO: At 2017-05-05 15:44:52.985650286 -0400 EDT - event for mongodb-1-deploy: {kubelet 172.18.0.186} FailedMount: Unable to mount volumes for pod "mongodb-1-deploy_extended-test-jenkins-pipeline-n2w9s-r8pf3(c7fe9baa-31ca-11e7-ab80-0e76db0f9e3c)": timeout expired waiting for volumes to attach/mount for pod "extended-test-jenkins-pipeline-n2w9s-r8pf3"/"mongodb-1-deploy". list of unattached/unmounted volumes=[deployer-token-0qmcm]
May  5 15:48:46.203: INFO: At 2017-05-05 15:44:52.985704605 -0400 EDT - event for mongodb-1-deploy: {kubelet 172.18.0.186} FailedSync: Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "extended-test-jenkins-pipeline-n2w9s-r8pf3"/"mongodb-1-deploy". list of unattached/unmounted volumes=[deployer-token-0qmcm]

I'm not ready to close this issue out in lieu of #14093 until

a) we get a good week or so of runs where a failure occurs and this apparent mount issues is not in play ... though admittedly, it is entirely conceivable that these were present in the older runs and I just missed them / wasn't cognizant enough earlier on to look for them

b) the oc logs | oc start-build --follow not returning phenomenon is still in theory pending ... minimally, as a fall back I'm thinking we switch the oc start-build --follow to oc start-build --wait in our jee jenkins pipeline sample

@gabemontero
Copy link
Contributor

FYI - broke out the log following thread here to separate issue #14148

@gabemontero
Copy link
Contributor

the latest findings in #14013 are pointing toward a solution for this issue

if the trend in the PR continues for another day or so, I'll rebrand it as a fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1
Projects
None yet
Development

No branches or pull requests

5 participants