[builds][Slow] openshift pipeline build Pipeline with maven slave [It] should build and complete successfully #13984

coreydaley · 2017-05-01T20:34:08Z

[Fail] [builds][Slow] openshift pipeline build Pipeline with maven slave [It] should build and complete successfully
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/builds/pipeline.go:149

[INFO] Building war: /tmp/workspace/extended-test-jenkins-pipeline-0hvp5-dpbwt-openshift-jee-sample/target/ROOT.war
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 5.913 s
[INFO] Finished at: 2017-05-01T19:01:55+00:00
[INFO] Final Memory: 14M/202M
[INFO] ------------------------------------------------------------------------
[Pipeline] stash
Stashed 1 file(s)
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] node
Running on master in /var/lib/jenkins/jobs/extended-test-jenkins-pipeline-0hvp5-dpbwt-openshift-jee-sample/workspace
[Pipeline] {
[Pipeline] stage
[Pipeline] { (Build Image)
[Pipeline] unstash
[Pipeline] sh
[workspace] Running shell script
+ oc start-build openshift-jee-sample-docker --from-file=target/ROOT.war --follow -n extended-test-jenkins-pipeline-0hvp5-dpbwt
Uploading file "target/ROOT.war" as binary input for the build ...
Unable to connect to the server: net/http: TLS handshake timeout

https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended/285/consoleFull#-86936742456bf4006e4b05b79524e5923

I was unable to reproduce locally.

The text was updated successfully, but these errors were encountered:

stevekuznetsov · 2017-05-01T21:21:58Z

Dupe of #13983

stevekuznetsov · 2017-05-02T15:44:29Z

/cc @gabemontero

gabemontero · 2017-05-02T20:54:38Z

OK, grepping on the build config openshift-jee-sample-docker in the master log, there are only gets on the bc, which I think can all be attributed to config_controller.go, and then the delete of the resource when the test ends and the namespace is cleaned up:

openshift.log:124420:I0501 14:59:53.300932    4209 config_controller.go:47] Handling BuildConfig extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:124421:I0501 14:59:53.300944    4209 config_controller.go:49] Handling Build Pruning: extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:124422:I0501 14:59:53.301549    4209 audit.go:133] 2017-05-01T14:59:53.30151033-04:00 AUDIT: id="d54ff94f-a541-497b-805d-23bbb6954c76" ip="172.18.0.216" method="GET" user="system:openshift-master" groups="\"system:masters\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="extended-test-jenkins-pipeline-0hvp5-dpbwt" uri="/oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker"
openshift.log:124424:I0501 14:59:53.302620    4209 wrap.go:75] GET /oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker: (1.36226ms) 200 [[openshift/v3.6.0 (linux/amd64) openshift/b48cd3c] 172.18.0.216:34570]
openshift.log:125650:I0501 15:00:04.706976    4209 config_controller.go:47] Handling BuildConfig extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:125651:I0501 15:00:04.706985    4209 config_controller.go:49] Handling Build Pruning: extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:125652:I0501 15:00:04.707410    4209 audit.go:133] 2017-05-01T15:00:04.707385112-04:00 AUDIT: id="736923c1-92ce-4ed9-9ff9-99e39c6cf5bc" ip="172.18.0.216" method="GET" user="system:openshift-master" groups="\"system:masters\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="extended-test-jenkins-pipeline-0hvp5-dpbwt" uri="/oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker"
openshift.log:125654:I0501 15:00:04.708287    4209 wrap.go:75] GET /oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker: (1.08399ms) 200 [[openshift/v3.6.0 (linux/amd64) openshift/b48cd3c] 172.18.0.216:34570]
openshift.log:130358:I0501 15:02:04.707679    4209 config_controller.go:47] Handling BuildConfig extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:130359:I0501 15:02:04.707688    4209 config_controller.go:49] Handling Build Pruning: extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:130360:I0501 15:02:04.708148    4209 audit.go:133] 2017-05-01T15:02:04.708097065-04:00 AUDIT: id="2470851f-0434-4a0d-b65c-190aaae61934" ip="172.18.0.216" method="GET" user="system:openshift-master" groups="\"system:masters\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="extended-test-jenkins-pipeline-0hvp5-dpbwt" uri="/oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker"
openshift.log:130362:I0501 15:02:04.709009    4209 wrap.go:75] GET /oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker: (1.094511ms) 200 [[openshift/v3.6.0 (linux/amd64) openshift/b48cd3c] 172.18.0.216:34570]
openshift.log:135066:I0501 15:04:04.707574    4209 config_controller.go:47] Handling BuildConfig extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:135067:I0501 15:04:04.707586    4209 config_controller.go:49] Handling Build Pruning: extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:135068:I0501 15:04:04.708042    4209 audit.go:133] 2017-05-01T15:04:04.7080175-04:00 AUDIT: id="fef275b0-9866-4748-bb9d-f999a4bb382d" ip="172.18.0.216" method="GET" user="system:openshift-master" groups="\"system:masters\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="extended-test-jenkins-pipeline-0hvp5-dpbwt" uri="/oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker"
openshift.log:135070:I0501 15:04:04.708870    4209 wrap.go:75] GET /oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker: (1.030516ms) 200 [[openshift/v3.6.0 (linux/amd64) openshift/b48cd3c] 172.18.0.216:34570]
openshift.log:138895:I0501 15:06:04.707571    4209 config_controller.go:47] Handling BuildConfig extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:138896:I0501 15:06:04.707579    4209 config_controller.go:49] Handling Build Pruning: extended-test-jenkins-pipeline-0hvp5-dpbwt/openshift-jee-sample-docker
openshift.log:138897:I0501 15:06:04.707985    4209 audit.go:133] 2017-05-01T15:06:04.70796178-04:00 AUDIT: id="bb43c1f0-f38b-4c34-8552-982a6d34601b" ip="172.18.0.216" method="GET" user="system:openshift-master" groups="\"system:masters\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="extended-test-jenkins-pipeline-0hvp5-dpbwt" uri="/oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker"
openshift.log:138899:I0501 15:06:04.708894    4209 wrap.go:75] GET /oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker: (1.098852ms) 200 [[openshift/v3.6.0 (linux/amd64) openshift/b48cd3c] 172.18.0.216:34570]
openshift.log:140900:I0501 15:07:12.720613    4209 audit.go:133] 2017-05-01T15:07:12.720588278-04:00 AUDIT: id="14a8dcd1-8eb2-410a-b188-f82f849f6beb" ip="172.18.0.216" method="DELETE" user="system:openshift-master" groups="\"system:masters\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="extended-test-jenkins-pipeline-0hvp5-dpbwt" uri="/oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker"
openshift.log:140903:I0501 15:07:12.722806    4209 wrap.go:75] DELETE /oapi/v1/namespaces/extended-test-jenkins-pipeline-0hvp5-dpbwt/buildconfigs/openshift-jee-sample-docker: (2.37841ms) 200 [[openshift/v3.6.0 (linux/amd64) openshift/b48cd3c] 172.18.0.216:34570]

In other words, no POSTs that would have occured if the build config instantiatebinary endpoint had bubbled up to the master's audit.go or wrap.go level.

So things to pursue:

is there lower level networking type trace possible on either client or master side within the context of our extended tests running on jenkins,
traversing through the k8s rest client code, I see a very sparse sprinkling of glog.V(4) trace; we could try adding a --loglevel=10 to the oc start-build in the maven test, and see if that reveals anything

I'll start a PR for 2), and start asking around about 1)

oatmealraisin · 2017-05-03T14:41:48Z

Another failure in the extended tests today.

An excerpt from the build logs shows some networking issues

Cloning repository https://github.com/openshift/jenkins-client-plugin.git
 > git init /var/lib/jenkins/jobs/extended-test-jenkins-pipeline-3spsj-q825m-sample-pipeline-openshift-client-plugin/workspace@script # timeout=10
Fetching upstream changes from https://github.com/openshift/jenkins-client-plugin.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/openshift/jenkins-client-plugin.git +refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url https://github.com/openshift/jenkins-client-plugin.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/openshift/jenkins-client-plugin.git # timeout=10
Fetching upstream changes from https://github.com/openshift/jenkins-client-plugin.git
 > git fetch --tags --progress https://github.com/openshift/jenkins-client-plugin.git +refs/heads/*:refs/remotes/origin/*
 > git rev-parse origin/master^{commit} # timeout=10
Checking out Revision ba6a931f77107189a796dc966425b08f8e710d5a (origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f ba6a931f77107189a796dc966425b08f8e710d5a
First time build. Skipping changelog.
[Pipeline] echo

[Pipeline] node
Running on master in /var/lib/jenkins/jobs/extended-test-jenkins-pipeline-3spsj-q825m-sample-pipeline-openshift-client-plugin/workspace
[Pipeline] {
[Pipeline] _OcContextInit
[Pipeline] readFile
[Pipeline] readFile
[Pipeline] _OcAction
[describe] 
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
ERROR: Error during describe;
{reference={}, err=Unable to connect to the server: net/http: TLS handshake timeout, verb=describe, cmd=oc describe serviceaccount --server=https://172.30.0.1:443 --namespace=extended-test-jenkins-pipeline-3spsj-q825m --token=XXXXX , out=, status=1}

Finished: FAILURE
```Cloning repository https://github.com/openshift/jenkins-client-plugin.git
 > git init /var/lib/jenkins/jobs/extended-test-jenkins-pipeline-3spsj-q825m-sample-pipeline-openshift-client-plugin/workspace@script # timeout=10
Fetching upstream changes from https://github.com/openshift/jenkins-client-plugin.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/openshift/jenkins-client-plugin.git +refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url https://github.com/openshift/jenkins-client-plugin.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/openshift/jenkins-client-plugin.git # timeout=10
Fetching upstream changes from https://github.com/openshift/jenkins-client-plugin.git
 > git fetch --tags --progress https://github.com/openshift/jenkins-client-plugin.git +refs/heads/*:refs/remotes/origin/*
 > git rev-parse origin/master^{commit} # timeout=10
Checking out Revision ba6a931f77107189a796dc966425b08f8e710d5a (origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f ba6a931f77107189a796dc966425b08f8e710d5a
First time build. Skipping changelog.
[Pipeline] echo

[Pipeline] node
Running on master in /var/lib/jenkins/jobs/extended-test-jenkins-pipeline-3spsj-q825m-sample-pipeline-openshift-client-plugin/workspace
[Pipeline] {
[Pipeline] _OcContextInit
[Pipeline] readFile
[Pipeline] readFile
[Pipeline] _OcAction
[describe] 
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
ERROR: Error during describe;
{reference={}, err=Unable to connect to the server: net/http: TLS handshake timeout, verb=describe, cmd=oc describe serviceaccount --server=https://172.30.0.1:443 --namespace=extended-test-jenkins-pipeline-3spsj-q825m --token=XXXXX , out=, status=1}

Finished: FAILURE

The jenkins logs also has repeated connection exceptions thrown,

May 03, 2017 1:37:30 PM io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager nextReconnectInterval
INFO: Current reconnect backoff is 1000 milliseconds (T0)
May 03, 2017 1:37:30 PM io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager closeWebSocket
WARNING: Failed to write websocket close message: class java.net.SocketException Socket is closed
May 03, 2017 1:37:30 PM io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2 onFailure
WARNING: Exec Failure
java.net.SocketException: Connection closed by remote host
	at sun.security.ssl.SSLSocketImpl.checkWrite(SSLSocketImpl.java:1565)
	at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:71)
	at okio.Okio$1.write(Okio.java:78)
	at okio.AsyncTimeout$1.write(AsyncTimeout.java:180)
	at okio.RealBufferedSink.emit(RealBufferedSink.java:178)
	at okhttp3.internal.ws.WebSocketWriter.writeControlFrameSynchronized(WebSocketWriter.java:152)
	at okhttp3.internal.ws.WebSocketWriter.writePing(WebSocketWriter.java:79)
	at okhttp3.internal.ws.RealWebSocket.sendPing(RealWebSocket.java:143)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2$1.execute(WatchConnectionManager.java:171)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$NamedRunnable.run(WatchConnectionManager.java:424)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

gabemontero · 2017-05-03T14:46:57Z

In chatting with @oatmealraisin and digesting the data he noted, of particular interest to me was:

ERROR: Error during describe;
{reference={}, err=Unable to connect to the server: net/http: TLS handshake timeout, verb=describe, cmd=oc describe serviceaccount --server=https://172.30.0.1:443 --namespace=extended-test-jenkins-pipeline-3spsj-q825m --token=XXXXX , out=, status=1}

Is exactly like the error seen previously with oc start-build, only with oc describe this time.

I agree the java side exceptions @oatmealraisin noted seem similar.

We'll have to consider when diving into this more whether this issue is somehow Jenkins/JVM specific, or if the Jenkins/JVM scenario is surfacing a new instability in openshift/origin.

gabemontero · 2017-05-03T17:16:35Z

Possible correlation ... in debugging some new unit tests, if an api object was invalid, rather than getting a validation error, I got a tls handshake error:

--- FAIL: TestSetPipelineBuild (0.00s)
	logs_test.go:111: &{{ } {      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] } {{ <nil> <nil> <nil> {map[] map[]} map[] map[] <nil>} 0 [] 0 <nil> false false map[] <nil>} {0 0 0 0 0 0 <nil> [] 0}}, unexpected results isPipeline false JenkinsLogs true
2017-05-03 13:02:29.801786 I | http: TLS handshake error from 127.0.0.1:48424: EOF
2017-05-03 13:02:29.819717 I | http: TLS handshake error from 127.0.0.1:49948: EOF
Tag yourproject/rails:tip set to openshift/ruby:latest.
Tag yourproject/rails:tip set to openshift/ruby:2.0.
Deleted tag yourproject/rails:tip.
Tag yourproject/rails:tip set to openshift/ruby:2.0.
Tag yourproject/rails:tip set to openshift/ruby:2.0.
Deleted tag yourproject/rails:tip.
FAIL

I wonder if at least for the failed oc calls, if some other error is getting masked as tls handshake ....

gabemontero · 2017-05-04T20:02:31Z

I am seeing another flavor of this test failing. Specifically, it looks like the oc start-build --follow is not exiting because of the --follow. I've seen this when running oc start-build openshift-jee-sample-docker --follow manually. The build finished but the oc ... --follow does not exit and I have to control-c.

I've also reproduced this locally/manually in jenkins deployed in openshift. It is not everytime, but has happened more than once.

Also, on the point of "what might have changed" ... we did update the oc client in the jenkins image late last week: see openshift/jenkins@d48c23c

I wonder if that is at least part of the instability we've seen, at least in the cases where invocations of oc are involved (which includes our extended tests and the openshift-client plugin). It would exclude sync plugin / fabric8 ssl comm issues, and the master seeing readiness check failures for the jenkins pod.

@bparees wdyt ^^

UPDATE: switching --follow to --wait in the example maven pipeline template has the job passing consistently now in my local env (vs. the --follow version hanging at least half the time).

gabemontero · 2017-05-04T20:05:38Z

Also note, oc logs really defers to k8s logs for the actual streaming, etc. It potentially could have been impacted by the rebase.

stevekuznetsov · 2017-05-05T00:04:18Z

@ncdc could probably weigh in on kubectl logs and/or spdystream

ncdc · 2017-05-05T00:19:30Z

We don't use spdy for retrieving logs. Only exec, attach, and portforward.

There is a kube PR to fix an issue where --follow wouldn't stop when the container exited, but as best I can tell, it only applies to the json-file docker log driver. kubernetes/kubernetes#44406.

If you start a pod that does something like print out the date every second for five or ten seconds before terminating, and then in another terminal run oc logs mypod -f, does it stop following, or does it hang waiting for more logs?

gabemontero · 2017-05-05T00:34:13Z

On Thu, May 4, 2017 at 8:19 PM Andy Goldstein ***@***.***> wrote: We don't use spdy for retrieving logs. Only exec, attach, and portforward. There is a kube PR to fix an issue where --follow wouldn't stop when the container exited, but as best I can tell, it only applies to the json-file docker log driver. kubernetes/kubernetes#44406 <kubernetes/kubernetes#44406>. If you start a pod that does something like print out the date every second for five or ten seconds before terminating, and then in another terminal run oc logs mypod -f, does it stop following, or does it hang waiting for more logs?

The frequency of logging was similar to what you described, and it hangs waiting for more logs. In other words The pod was done generating output (for an openshift build in our case), and all the logs showed where the --follow was employed. The build pod / container had exited as well.

…

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#13984 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADbadNX55nY4ysfpQcgCHRz0XR8Pq7omks5r2msYgaJpZM4NNaF8> .

ncdc · 2017-05-05T00:36:18Z

Would you have time to try to reproduce with the smallest example possible, like what I described above with just a simple pod? On Thu, May 4, 2017 at 8:34 PM, Gabe Montero <[email protected]> wrote:

…

On Thu, May 4, 2017 at 8:19 PM Andy Goldstein ***@***.***> wrote: > We don't use spdy for retrieving logs. Only exec, attach, and portforward. > > There is a kube PR to fix an issue where --follow wouldn't stop when the > container exited, but as best I can tell, it only applies to the json-file > docker log driver. kubernetes/kubernetes#44406 > <kubernetes/kubernetes#44406>. > > If you start a pod that does something like print out the date every > second for five or ten seconds before terminating, and then in another > terminal run oc logs mypod -f, does it stop following, or does it hang > waiting for more logs? > The frequency of logging was similar to what you described, and it hangs waiting for more logs. In other words The pod was done generating output (for an openshift build in our case), and all the logs showed where the --follow was employed. The build pod / container had exited as well. > — > You are receiving this because you were assigned. > Reply to this email directly, view it on GitHub > <#13984 (comment) >, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ ADbadNX55nY4ysfpQcgCHRz0XR8Pq7omks5r2msYgaJpZM4NNaF8> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13984 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAABYpQw3zW8qfKdOmCQcqBCejUW0ai3ks5r2m6SgaJpZM4NNaF8> .

gabemontero · 2017-05-05T00:40:24Z

Separate from the --follow item discussed here, on the debug front For the TLS issues I'm planning to

ramp the extended test scope ... the focused scope has been too successful
on the jenkins image commit I noted earlier, in addition to a bump in oc version there was a bump in jenkins version; assuming I can get at least intermittent failures I am the going to run with custom jenkins images that revert first jenkins, then oc, and see if that has bearing

gabemontero · 2017-05-05T00:43:48Z

On Thu, May 4, 2017 at 8:36 PM Andy Goldstein <[email protected]> wrote:

Would you have time to try to reproduce with the smallest example possible, like what I described above with just a simple pod?

Sure. I've got some diagnostic ideas I just posted for the more prevalent TLS comm errors, but when I get done with those I'll circle back and try a simple pod.

…

On Thu, May 4, 2017 at 8:34 PM, Gabe Montero ***@***.***> wrote: > On Thu, May 4, 2017 at 8:19 PM Andy Goldstein ***@***.***> > wrote: > > > We don't use spdy for retrieving logs. Only exec, attach, and > portforward. > > > > There is a kube PR to fix an issue where --follow wouldn't stop when the > > container exited, but as best I can tell, it only applies to the > json-file > > docker log driver. kubernetes/kubernetes#44406 > > <kubernetes/kubernetes#44406>. > > > > If you start a pod that does something like print out the date every > > second for five or ten seconds before terminating, and then in another > > terminal run oc logs mypod -f, does it stop following, or does it hang > > waiting for more logs? > > > The frequency of logging was similar to what you described, and it hangs > waiting for more logs. In other words The pod was done generating output > (for an openshift build in our case), and all the logs showed where the > --follow was employed. > > The build pod / container had exited as well. > > > — > > You are receiving this because you were assigned. > > Reply to this email directly, view it on GitHub > > < #13984 (comment) > >, > > or mute the thread > > <https://github.com/notifications/unsubscribe-auth/ > ADbadNX55nY4ysfpQcgCHRz0XR8Pq7omks5r2msYgaJpZM4NNaF8> > > . > > > > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#13984 (comment) >, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/AAABYpQw3zW8qfKdOmCQcqBCejUW0ai3ks5r2m6SgaJpZM4NNaF8 > > . > — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#13984 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADbadLiv75UVMVJdmkT0DfaDPKsnj9haks5r2m8PgaJpZM4NNaF8> .

gabemontero · 2017-05-05T16:02:58Z

@ncdc - here is what I believe is a simple pod example:
pod.json.zip

Details on my locally build origin:

gmontero ~/go/src/github.com/openshift/origin  (maven-ext-test-dbg)$ openshift version
openshift v3.6.0-alpha.1+999c737-400
kubernetes v1.6.1+5115d708d7
etcd 3.1.0
gmontero ~/go/src/github.com/openshift/origin  (maven-ext-test-dbg)$ oc version
oc v3.6.0-alpha.1+999c737-400
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://127.0.0.1:8443
openshift v3.6.0-alpha.1+999c737-400
kubernetes v1.6.1+5115d708d7
gmontero ~/go/src/github.com/openshift/origin  (maven-ext-test-dbg)$

This is at some level I built yesterday (with a debug commit on top).

This particular pod ends up error'ing out after running for a bit, but since I've seen this occur repeatedly with both pods that complete after a while and pods that error out after a while, I figured it wasn't worth debugging this sample further.

I'm going to rebase / rebuild my origin in a sec from latest master, to see if the error still occurs. I'll report the finding when I have them.

thanks

gabemontero · 2017-05-05T16:38:09Z

@ncdc - hey, still happens for me after rebasing this moring.

version particulars:

gmontero ~/go/src/github.com/openshift/origin  (master)$ git log -3
commit b9dd637aaee53f47e9eded0d9b2e87d076766d93
Merge: 8f1797d 4c44fe4
Author: OpenShift Bot <[email protected]>
Date:   Fri May 5 09:51:13 2017 -0500

    Merge pull request #13932 from enj/enj/i/gssapi_dc_wait/13927
    
    Merged by openshift-bot

commit 8f1797d73a754e32eb9059aa428e823bd3472bf9
Merge: 076d231 1e93649
Author: OpenShift Bot <[email protected]>
Date:   Fri May 5 08:18:46 2017 -0500

    Merge pull request #14040 from bparees/defaulter
    
    Merged by openshift-bot

commit 076d2314b0296b6c0d486010ef6167cae6cec1b5
Merge: 009cbef 58868b5
Author: OpenShift Bot <[email protected]>
Date:   Fri May 5 04:37:21 2017 -0500

    Merge pull request #14034 from ncdc/do-not-show-policy-rules-with-attribute-restrictions
    
    Merged by openshift-bot
gmontero ~/go/src/github.com/openshift/origin  (master)$ openshift version
openshift v3.6.0-alpha.1+b9dd637-430
kubernetes v1.6.1+5115d708d7
etcd 3.1.0
gmontero ~/go/src/github.com/openshift/origin  (master)$ oc version
oc v3.6.0-alpha.1+b9dd637-430
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://127.0.0.1:8443
openshift v3.6.0-alpha.1+b9dd637-430
kubernetes v1.6.1+5115d708d7

And it is probably obvious, but to be clear, I ran:

oc create -f pods.json
oc logs -f pods/openshift-jee-sample-docker-1-build

thanks

ncdc · 2017-05-05T17:22:12Z

@gabemontero I haven't tested with a build pod, but I have tested with a simple centos:centos7, and the oc logs -f command terminates as soon as the container exits.

gabemontero · 2017-05-05T17:30:17Z

On Fri, May 5, 2017 at 1:22 PM Andy Goldstein ***@***.***> wrote: @gabemontero <https://github.com/gabemontero> I haven't tested with a build pod, but I have tested with a simple centos:centos7, and the oc logs -f command terminates as soon as the container exits.

@ncdc yep ther is an intermittent nature to this. I've seen it consistently on my fedora 25 laptop but only occasionally in our extended tests on ci.openshift. Not sure how to pin down the environmental differences.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13984 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADbadE6FCBuM3sAeZLe4ZAPlVp51hvAJks5r21rLgaJpZM4NNaF8> .

gabemontero · 2017-05-05T17:36:40Z

On Fri, May 5, 2017 at 1:30 PM Gabe Montero ***@***.***> wrote: On Fri, May 5, 2017 at 1:22 PM Andy Goldstein ***@***.***> wrote: > @gabemontero <https://github.com/gabemontero> I haven't tested with a > build pod, but I have tested with a simple centos:centos7, and the oc > logs -f command terminates as soon as the container exits. > @ncdc yep ther is an intermittent nature to this. I've seen it consistently on my fedora 25 laptop but only occasionally in our extended tests on ci.openshift. Not sure how to pin down the environmental differences.

Perhaps I should see if I can induce a golang thread dump on the oc logs on my system ....would that be of any use?

…

— > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#13984 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ADbadE6FCBuM3sAeZLe4ZAPlVp51hvAJks5r21rLgaJpZM4NNaF8> > . >

ncdc · 2017-05-05T17:38:16Z

Definitely. `kill -3`. You may need to redirect the output of `oc logs` to a file in case the thread dump overflows your terminal scrollback. On Fri, May 5, 2017 at 1:36 PM, Gabe Montero <[email protected]> wrote:

…

On Fri, May 5, 2017 at 1:30 PM Gabe Montero ***@***.***> wrote: > On Fri, May 5, 2017 at 1:22 PM Andy Goldstein ***@***.***> > wrote: > >> @gabemontero <https://github.com/gabemontero> I haven't tested with a >> build pod, but I have tested with a simple centos:centos7, and the oc >> logs -f command terminates as soon as the container exits. >> > @ncdc yep ther is an intermittent nature to this. I've seen it > consistently on my fedora 25 laptop but only occasionally in our extended > tests on ci.openshift. Not sure how to pin down the environmental > differences. > Perhaps I should see if I can induce a golang thread dump on the oc logs on my system ....would that be of any use? > — >> You are receiving this because you were mentioned. >> Reply to this email directly, view it on GitHub >> <#13984# issuecomment-299524626>, >> or mute the thread >> <https://github.com/notifications/unsubscribe-auth/ ADbadE6FCBuM3sAeZLe4ZAPlVp51hvAJks5r21rLgaJpZM4NNaF8> >> . >> > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13984 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAABYmUc0iHUq1zZyRxsHEGnsaCEt62mks5r214xgaJpZM4NNaF8> .

gabemontero · 2017-05-05T19:32:49Z

@ncdc here is the thread dump:
thread-dump.txt

Other than goroutine 72, everything else is blocked on IO or semaphores down in golang code.

The comment before the channel select in the apimachinery code was ominous at first glance:

		// NOTE: b/c there is no priority selection in golang
		// it is possible for this to race, meaning we could
		// trigger t.C and stopCh, and t.C select falls through.
		// In order to mitigate we re-check stopCh at the beginning
		// of every loop to prevent extra executions of f().

Though that logic has not changed in a while if I'm reading git blame correctly.

Here is my go version: go version go1.7.3 linux/amd64

first blush, smells like a golang issue

(admittedly novice) opinion, I would contend some defensive programming is in order, and that the call to io.Copy in k8s.io/kubernetes/pkg/kubectl/cmd.LogsOptions.getLogs should be coupled with a select/channel/timer so we don't block forever ... thoughts?

UPDATE (May 11) - upgraded to go1.7.5, no affect on my local runs of oc logs -f.

gabemontero · 2017-05-08T20:00:00Z

Older PR test and overnight failures have been pruned, but between the debug runs in #14013, a recent overnight run that resulted in #14093,
and @coreydaley 's failure noted in #14084, I'm not seeing the common thread of volume mount issues. In the case of the client plugin, they result in the mongodb deployment failing. Here is a snippet from @coreydaley 's run cataloged in #14084:

May  5 15:48:46.203: INFO: At 2017-05-05 15:42:23.452467294 -0400 EDT - event for jenkins-1-deploy: {kubelet 172.18.0.186} FailedMount: Unable to mount volumes for pod "jenkins-1-deploy_extended-test-jenkins-pipeline-n2w9s-r8pf3(99bb5368-31ca-11e7-ab80-0e76db0f9e3c)": timeout expired waiting for volumes to attach/mount for pod "extended-test-jenkins-pipeline-n2w9s-r8pf3"/"jenkins-1-deploy". list of unattached/unmounted volumes=[deployer-token-0qmcm]
May  5 15:48:46.203: INFO: At 2017-05-05 15:42:23.452535031 -0400 EDT - event for jenkins-1-deploy: {kubelet 172.18.0.186} FailedSync: Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "extended-test-jenkins-pipeline-n2w9s-r8pf3"/"jenkins-1-deploy". list of unattached/unmounted volumes=[deployer-token-0qmcm]
May  5 15:48:46.203: INFO: At 2017-05-05 15:42:30.954662138 -0400 EDT - event for mongodb-1-77ttj: {kubelet 172.18.0.186} Started: Started container with id 428c475c64b8b41904775b28744ca7793090960feb08376ccfd54dd5f8a60677
May  5 15:48:46.203: INFO: At 2017-05-05 15:42:53.009580387 -0400 EDT - event for mongodb-1-deploy: {kubelet 172.18.0.186} Killing: Killing container with id docker://95b68fb656cf1c0c17e542ab73123629e84885caae3cc6133bcc673864b6b479:Need to kill Pod
May  5 15:48:46.203: INFO: At 2017-05-05 15:44:52.985650286 -0400 EDT - event for mongodb-1-deploy: {kubelet 172.18.0.186} FailedMount: Unable to mount volumes for pod "mongodb-1-deploy_extended-test-jenkins-pipeline-n2w9s-r8pf3(c7fe9baa-31ca-11e7-ab80-0e76db0f9e3c)": timeout expired waiting for volumes to attach/mount for pod "extended-test-jenkins-pipeline-n2w9s-r8pf3"/"mongodb-1-deploy". list of unattached/unmounted volumes=[deployer-token-0qmcm]
May  5 15:48:46.203: INFO: At 2017-05-05 15:44:52.985704605 -0400 EDT - event for mongodb-1-deploy: {kubelet 172.18.0.186} FailedSync: Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "extended-test-jenkins-pipeline-n2w9s-r8pf3"/"mongodb-1-deploy". list of unattached/unmounted volumes=[deployer-token-0qmcm]

I'm not ready to close this issue out in lieu of #14093 until

a) we get a good week or so of runs where a failure occurs and this apparent mount issues is not in play ... though admittedly, it is entirely conceivable that these were present in the older runs and I just missed them / wasn't cognizant enough earlier on to look for them

b) the oc logs | oc start-build --follow not returning phenomenon is still in theory pending ... minimally, as a fall back I'm thinking we switch the oc start-build --follow to oc start-build --wait in our jee jenkins pipeline sample

gabemontero · 2017-05-11T17:49:41Z

FYI - broke out the log following thread here to separate issue #14148

gabemontero · 2017-06-21T15:25:55Z

the latest findings in #14013 are pointing toward a solution for this issue

if the trend in the PR continues for another day or so, I'll rebrand it as a fix

coreydaley added kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1 labels May 1, 2017

stevekuznetsov closed this as completed May 1, 2017

stevekuznetsov reopened this May 2, 2017

gabemontero self-assigned this May 2, 2017

gabemontero mentioned this issue May 2, 2017

fix for tls handshake timeout flakes with jenkins pipelines #14013

Merged

oatmealraisin mentioned this issue May 4, 2017

[builds][Slow] openshift pipeline build Pipeline using jenkins-client-plugin [It] should build and complete successfully #14057

Closed

gabemontero mentioned this issue May 5, 2017

[Fail] [builds][Slow] openshift pipeline build Pipeline using jenkins-client-plugin [It] should build and complete successfully #14084

Closed

coreydaley mentioned this issue May 8, 2017

Cleanup policy for builds #13788

Merged

gabemontero mentioned this issue May 11, 2017

oc logs -f, oc start-builds --follow, hang #14148

Closed

gabemontero mentioned this issue May 12, 2017

disambiguate directory and branch names via -- #14103

Merged

gabemontero mentioned this issue May 25, 2017

add best practice try/catch, timeout, specifically with extended test… #14354

Merged

fabianofranz mentioned this issue May 30, 2017

UPSTREAM: 46608: fixes kubectl cached discovery on Windows #14399

Merged

coreydaley mentioned this issue May 30, 2017

Updating to include BuildPhaseCancelled and BuildPhaseError #14375

Merged

csrwng mentioned this issue Jun 13, 2017

Refactor Build Controller to use Informers #14289

Merged

coreydaley mentioned this issue Jun 17, 2017

Add build pruning test for BuildPhaseError #14720

Merged

openshift-bot closed this as completed in #14013 Jun 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[builds][Slow] openshift pipeline build Pipeline with maven slave [It] should build and complete successfully #13984

[builds][Slow] openshift pipeline build Pipeline with maven slave [It] should build and complete successfully #13984

coreydaley commented May 1, 2017 •

edited

Loading

stevekuznetsov commented May 1, 2017

stevekuznetsov commented May 2, 2017

gabemontero commented May 2, 2017 •

edited

Loading

oatmealraisin commented May 3, 2017

gabemontero commented May 3, 2017

gabemontero commented May 3, 2017

gabemontero commented May 4, 2017 •

edited

Loading

gabemontero commented May 4, 2017

stevekuznetsov commented May 5, 2017

ncdc commented May 5, 2017

gabemontero commented May 5, 2017 via email

ncdc commented May 5, 2017 via email

gabemontero commented May 5, 2017

gabemontero commented May 5, 2017 via email

gabemontero commented May 5, 2017

gabemontero commented May 5, 2017

ncdc commented May 5, 2017

gabemontero commented May 5, 2017 via email

gabemontero commented May 5, 2017 via email

ncdc commented May 5, 2017 via email

gabemontero commented May 5, 2017 •

edited

Loading

gabemontero commented May 8, 2017

gabemontero commented May 11, 2017

gabemontero commented Jun 21, 2017

[builds][Slow] openshift pipeline build Pipeline with maven slave [It] should build and complete successfully #13984

[builds][Slow] openshift pipeline build Pipeline with maven slave [It] should build and complete successfully #13984

Comments

coreydaley commented May 1, 2017 • edited Loading

stevekuznetsov commented May 1, 2017

stevekuznetsov commented May 2, 2017

gabemontero commented May 2, 2017 • edited Loading

oatmealraisin commented May 3, 2017

gabemontero commented May 3, 2017

gabemontero commented May 3, 2017

gabemontero commented May 4, 2017 • edited Loading

gabemontero commented May 4, 2017

stevekuznetsov commented May 5, 2017

ncdc commented May 5, 2017

gabemontero commented May 5, 2017 via email

ncdc commented May 5, 2017 via email

gabemontero commented May 5, 2017

gabemontero commented May 5, 2017 via email

gabemontero commented May 5, 2017

gabemontero commented May 5, 2017

ncdc commented May 5, 2017

gabemontero commented May 5, 2017 via email

gabemontero commented May 5, 2017 via email

ncdc commented May 5, 2017 via email

gabemontero commented May 5, 2017 • edited Loading

gabemontero commented May 8, 2017

gabemontero commented May 11, 2017

gabemontero commented Jun 21, 2017

coreydaley commented May 1, 2017 •

edited

Loading

gabemontero commented May 2, 2017 •

edited

Loading

gabemontero commented May 4, 2017 •

edited

Loading

gabemontero commented May 5, 2017 •

edited

Loading