cluster-launch-installer-e2e: Pull all logs in parallel #1817

wking · 2018-10-04T21:29:40Z

An alternative to #1815, this approach keeps up to 10 pulls going at once.

These are currently generating a lot of error messages. From recent openshift/installer#415 tests:

$ oc project ci-op-w11cl72x
$ oc logs e2e-aws -c teardown --timestamps
2018-10-04T18:17:06.557740109Z Gathering artifacts ...
2018-10-04T18:17:24.875374828Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log)
...
2018-10-04T18:17:29.331684772Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log)
2018-10-04T18:17:29.351919855Z Error from server (NotFound): the server could not find the requested resource
2018-10-04T18:17:39.592948165Z Error from server (BadRequest): previous terminated container "registry" in pod "registry-b6df966cf-fkhpl" not found
...
2018-10-04T18:29:24.457841097Z Error from server (BadRequest): previous terminated container "kube-addon-operator" in pod "kube-addon-operator-775d4c8f8d-289zm" not found
2018-10-04T18:29:24.466213055Z Waiting for node logs to finish ...
2018-10-04T18:29:24.466289887Z Deprovisioning cluster ...
2018-10-04T18:29:24.483065903Z level=debug msg="Deleting security groups"
...
2018-10-04T18:33:29.857465158Z level=debug msg="goroutine deleteVPCs complete"

So 12 minutes to pull the logs, followed by four minutes for destroy-cluster.

Looking at the extracted logs, lots of them are zero (which compresses to 20 bytes):

$ POD_LOGS="$(w3m -dump https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/artifacts/e2e-aws/pods/)"
$ echo "${POD_LOGS}" | grep '^ *20$' | wc -l
86
$ echo "${POD_LOGS}" | grep '\[file\]' | wc -l
172

So it's possible that the delay is due to the errors, or to a few large logs blocking the old, serial pod/container pulls.

With this pull request, I've added a new queue command. This command checks to see how many background jobs we have using jobs, and idles until we get below 10. Then it launches its particular command in the background. By using queue, we'll keep up to 10 log-fetches running in parallel, and the final wait will block for any which still happen to be running by that point.

The previous gzip invocations used -c, which dates back to 82d333e (#867). But with these gzip filters running on stdin anyway, the -c was superfluous. I've dropped it in this pull request.

Moving redirect target to a positional argument is a bit cludgy. I'd rather have a more familiar way of phrasing that redirect, but passing it in as ${1} was the best I've come up with.

With 10 pulls going at once. These are currently generating a lot of error messages. From recent openshift/installer#415 tests [1]: $ oc project ci-op-w11cl72x $ oc logs e2e-aws -c teardown --timestamps 2018-10-04T18:17:06.557740109Z Gathering artifacts ... 2018-10-04T18:17:24.875374828Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log) ... 2018-10-04T18:17:29.331684772Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log) 2018-10-04T18:17:29.351919855Z Error from server (NotFound): the server could not find the requested resource 2018-10-04T18:17:39.592948165Z Error from server (BadRequest): previous terminated container "registry" in pod "registry-b6df966cf-fkhpl" not found ... 2018-10-04T18:29:24.457841097Z Error from server (BadRequest): previous terminated container "kube-addon-operator" in pod "kube-addon-operator-775d4c8f8d-289zm" not found 2018-10-04T18:29:24.466213055Z Waiting for node logs to finish ... 2018-10-04T18:29:24.466289887Z Deprovisioning cluster ... 2018-10-04T18:29:24.483065903Z level=debug msg="Deleting security groups" ... 2018-10-04T18:33:29.857465158Z level=debug msg="goroutine deleteVPCs complete" So 12 minutes to pull the logs, followed by four minutes for destroy-cluster. Looking at the extracted logs, lots of them are zero (which compresses to 20 bytes): $ POD_LOGS="$(w3m -dump https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/artifacts/e2e-aws/pods/)" $ echo "${POD_LOGS}" | grep '^ *20$' | wc -l 86 $ echo "${POD_LOGS}" | grep '\[file\]' | wc -l 172 So it's possible that the delay is due to the errors, or to a few large logs blocking the old, serial pod/container pulls. With this commit, I've added a new 'queue' command. This command checks to see how many background jobs we have using 'jobs' [2], and idles until we get below 10. Then it launches its particular command in the background. By using 'queue', we'll keep up to 10 log-fetches running in parallel, and the final 'wait' will block for any which still happen to be running by that point. The previous gzip invocations used -c, which dates back to 82d333e (Set up artifact reporting for ci-operator jobs, 2018-05-17, openshift#867). But with these gzip filters running on stdin anyway, the -c was superfluous. I've dropped it in this commit. Moving redirect target to a positional argument is a bit cludgy. I'd rather have a more familiar way of phrasing that redirect, but passing it in as ${1} was the best I've come up with. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/build-log.txt [2]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/jobs.html

wking · 2018-10-04T21:48:48Z

I grabbed the kubeconfig via the OpenShift console and ran this script successfully against ci-op-3dphr9zn.

wking · 2018-10-04T21:50:27Z

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

+            sleep 1
+            LIVE="$(jobs | wc -l)"
+          done
+          echo "${@}"


This seemed interesting, but it makes the logs a bit noisier than they used to be. I'm fine dropping the echo if folks want quieter logs for these jobs.

abhinavdahiya · 2018-10-04T22:59:11Z

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

          # gather nodes first in parallel since they may contain the most relevant debugging info
          while IFS= read -r i; do
            mkdir -p /tmp/artifacts/nodes/$i
-            (
-            oc get --request-timeout=20s --raw /api/v1/nodes/$i/proxy/logs/messages | gzip -c > /tmp/artifacts/nodes/$i/messages.gz
+            FILTER=gzip queue /tmp/artifacts/nodes/$i/messages.gz oc get --request-timeout=20s --raw /api/v1/nodes/$i/proxy/logs/messages
            oc get --request-timeout=20s --raw /api/v1/nodes/$i/proxy/logs/journal | sed -e 's|.*href="\(.*\)".*|\1|;t;d' > /tmp/journals


Why no queue here?

Why no queue here?

Because we need /tmp/journals populated synchronously to feed the following while loop.

abhinavdahiya

/lgtm

openshift-ci-robot · 2018-10-04T23:39:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~ci-operator/templates/openshift/installer/OWNERS~~ [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2018-10-04T23:41:01Z

@wking: Updated the prow-job-cluster-launch-installer-e2e configmap using the following files:

key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

In response to this:

An alternative to #1815, this approach keeps up to 10 pulls going at once.

These are currently generating a lot of error messages. From recent openshift/installer#415 tests:
$ oc project ci-op-w11cl72x
$ oc logs e2e-aws -c teardown --timestamps
2018-10-04T18:17:06.557740109Z Gathering artifacts ...
2018-10-04T18:17:24.875374828Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log)
...
2018-10-04T18:17:29.331684772Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log)
2018-10-04T18:17:29.351919855Z Error from server (NotFound): the server could not find the requested resource
2018-10-04T18:17:39.592948165Z Error from server (BadRequest): previous terminated container "registry" in pod "registry-b6df966cf-fkhpl" not found
...
2018-10-04T18:29:24.457841097Z Error from server (BadRequest): previous terminated container "kube-addon-operator" in pod "kube-addon-operator-775d4c8f8d-289zm" not found
2018-10-04T18:29:24.466213055Z Waiting for node logs to finish ...
2018-10-04T18:29:24.466289887Z Deprovisioning cluster ...
2018-10-04T18:29:24.483065903Z level=debug msg="Deleting security groups"
...
2018-10-04T18:33:29.857465158Z level=debug msg="goroutine deleteVPCs complete"
So 12 minutes to pull the logs, followed by four minutes for destroy-cluster.

Looking at the extracted logs, lots of them are zero (which compresses to 20 bytes):
$ POD_LOGS="$(w3m -dump https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/artifacts/e2e-aws/pods/)"
$ echo "${POD_LOGS}" | grep '^ *20$' | wc -l
86
$ echo "${POD_LOGS}" | grep '\[file\]' | wc -l
172
So it's possible that the delay is due to the errors, or to a few large logs blocking the old, serial pod/container pulls.

With this pull request, I've added a new queue command. This command checks to see how many background jobs we have using jobs, and idles until we get below 10. Then it launches its particular command in the background. By using queue, we'll keep up to 10 log-fetches running in parallel, and the final wait will block for any which still happen to be running by that point.

The previous gzip invocations used -c, which dates back to 82d333e (#867). But with these gzip filters running on stdin anyway, the -c was superfluous. I've dropped it in this pull request.

Moving redirect target to a positional argument is a bit cludgy. I'd rather have a more familiar way of phrasing that redirect, but passing it in as ${1} was the best I've come up with.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2018-10-05T00:53:59Z

With the new script:

2018/10/05 00:25:04 Container test in pod e2e-aws failed, exit code 1, reason Error
2018/10/05 00:33:46 Container teardown in pod e2e-aws completed successfully

So 9 minutes vs. the previous 12, and we still have logs. Not as dramatic an improvement as I'd hoped, and this run failed so the logs might be shorter than the previous run anyway. Maybe we can tune the number of parallel pulls, but withought more details on bandwidth and such, I'd just be guessing.

Edit: actually, 9 minutes vs. the previous 16. That's not so bad ;).

wking · 2018-10-05T08:27:16Z

We may want to look into the history a bit more closely, because it looks like the recent speed-up is just getting us back to where we were at the beginning of the week. Here's a plot of e2e-aws duration over time for the past two weeks:

You can see that we come back from openshift/installer#415 much slower than we were before. We switched over to the CVO at 2018-10-02T16:25Z, so there were a few fast runs after that before the outage. I forget exactly what kicked that outage off; I think there were a number of other fixes besides openshift/installer#415. Anyhow, I'll poke around in the logs and at least see which stage (setup, test, or teardown) got slower before this PR landed.

For kicks, here's a chart with e2e-aws durations going back to our first success.

So the general trend is moving in the right direction ;). I've pushed the scripts and collected data up here if folks are interested.

wking · 2018-10-05T21:01:10Z

From a recent run (which includes #1831):

$ oc project ci-op-k2n3b7rw
$ oc logs e2e-aws -c teardown --timestamps
2018-10-05T20:54:31.215046482Z Gathering artifacts ...
2018-10-05T20:54:42.44196836Z oc --request-timeout=5s get nodes -o json
2018-10-05T20:54:42.444342909Z oc --request-timeout=5s get events --all-namespaces -o json
2018-10-05T20:54:42.448649504Z oc get --request-timeout=20s --raw /api/v1/nodes/ip-10-0-0-117.ec2.internal/proxy/logs/messages
2018-10-05T20:54:43.29532312Z Error from server (NotFound): the server could not find the requested resource
2018-10-05T20:54:43.403350622Z oc get --request-timeout=20s --raw /api/v1/nodes/ip-10-0-0-117.ec2.internal/proxy/logs/journal/b2b8c626002e4c35a8b0c660f81b0f8c/system.journal
2018-10-05T20:54:43.405958642Z oc get --request-timeout=20s --raw /api/v1/nodes/ip-10-0-0-117.ec2.internal/proxy/metrics
...
2018-10-05T20:55:23.184746943Z Error from server (BadRequest): container "package-server" in pod "package-server-f4d5bd459-7hdpg" is waiting to start: CreateContainerError
2018-10-05T20:55:23.18866575Z Error from server (BadRequest): previous terminated container "webconsole" in pod "webconsole-6698d4fbbc-674p8" not found
2018-10-05T20:55:23.323638427Z Deprovisioning cluster ...
...

So 52 seconds :).

abhinavdahiya · 2018-10-05T21:16:51Z

@wking this is super slick 👍 we should definitely put these scripts in installer repo / this release repo.
like contrib directory.

openshift-ci-robot requested review from staebler and steveej October 4, 2018 21:29

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 4, 2018

wking force-pushed the parallel-log-retrieval branch from 8db1c76 to 13c41e7 Compare October 4, 2018 21:46

wking commented Oct 4, 2018

View reviewed changes

abhinavdahiya reviewed Oct 4, 2018

View reviewed changes

openshift-ci-robot assigned abhinavdahiya Oct 4, 2018

abhinavdahiya approved these changes Oct 4, 2018

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 4, 2018

openshift-merge-robot merged commit 947d526 into openshift:master Oct 4, 2018

wking mentioned this pull request Oct 4, 2018

data/libvirt: Resize to 16 GiB openshift/installer#413

Closed

wking deleted the parallel-log-retrieval branch October 5, 2018 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster-launch-installer-e2e: Pull all logs in parallel #1817

cluster-launch-installer-e2e: Pull all logs in parallel #1817

wking commented Oct 4, 2018

wking commented Oct 4, 2018

wking Oct 4, 2018

abhinavdahiya Oct 4, 2018

wking Oct 4, 2018

abhinavdahiya left a comment

openshift-ci-robot commented Oct 4, 2018

openshift-ci-robot commented Oct 4, 2018

wking commented Oct 5, 2018 •

edited

Loading

wking commented Oct 5, 2018

wking commented Oct 5, 2018

abhinavdahiya commented Oct 5, 2018

cluster-launch-installer-e2e: Pull all logs in parallel #1817

cluster-launch-installer-e2e: Pull all logs in parallel #1817

Conversation

wking commented Oct 4, 2018

wking commented Oct 4, 2018

wking Oct 4, 2018

Choose a reason for hiding this comment

abhinavdahiya Oct 4, 2018

Choose a reason for hiding this comment

wking Oct 4, 2018

Choose a reason for hiding this comment

abhinavdahiya left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Oct 4, 2018

openshift-ci-robot commented Oct 4, 2018

wking commented Oct 5, 2018 • edited Loading

wking commented Oct 5, 2018

wking commented Oct 5, 2018

abhinavdahiya commented Oct 5, 2018

wking commented Oct 5, 2018 •

edited

Loading