Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-launch-installer-e2e: Pull all logs in parallel #1817

Merged

Conversation

wking
Copy link
Member

@wking wking commented Oct 4, 2018

An alternative to #1815, this approach keeps up to 10 pulls going at once.

These are currently generating a lot of error messages. From recent openshift/installer#415 tests:

$ oc project ci-op-w11cl72x
$ oc logs e2e-aws -c teardown --timestamps
2018-10-04T18:17:06.557740109Z Gathering artifacts ...
2018-10-04T18:17:24.875374828Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log)
...
2018-10-04T18:17:29.331684772Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log)
2018-10-04T18:17:29.351919855Z Error from server (NotFound): the server could not find the requested resource
2018-10-04T18:17:39.592948165Z Error from server (BadRequest): previous terminated container "registry" in pod "registry-b6df966cf-fkhpl" not found
...
2018-10-04T18:29:24.457841097Z Error from server (BadRequest): previous terminated container "kube-addon-operator" in pod "kube-addon-operator-775d4c8f8d-289zm" not found
2018-10-04T18:29:24.466213055Z Waiting for node logs to finish ...
2018-10-04T18:29:24.466289887Z Deprovisioning cluster ...
2018-10-04T18:29:24.483065903Z level=debug msg="Deleting security groups"
...
2018-10-04T18:33:29.857465158Z level=debug msg="goroutine deleteVPCs complete"

So 12 minutes to pull the logs, followed by four minutes for destroy-cluster.

Looking at the extracted logs, lots of them are zero (which compresses to 20 bytes):

$ POD_LOGS="$(w3m -dump https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/artifacts/e2e-aws/pods/)"
$ echo "${POD_LOGS}" | grep '^ *20$' | wc -l
86
$ echo "${POD_LOGS}" | grep '\[file\]' | wc -l
172

So it's possible that the delay is due to the errors, or to a few large logs blocking the old, serial pod/container pulls.

With this pull request, I've added a new queue command. This command checks to see how many background jobs we have using jobs, and idles until we get below 10. Then it launches its particular command in the background. By using queue, we'll keep up to 10 log-fetches running in parallel, and the final wait will block for any which still happen to be running by that point.

The previous gzip invocations used -c, which dates back to 82d333e (#867). But with these gzip filters running on stdin anyway, the -c was superfluous. I've dropped it in this pull request.

Moving redirect target to a positional argument is a bit cludgy. I'd rather have a more familiar way of phrasing that redirect, but passing it in as ${1} was the best I've come up with.

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 4, 2018
With 10 pulls going at once.

These are currently generating a lot of error messages.  From recent
openshift/installer#415 tests [1]:

  $ oc project ci-op-w11cl72x
  $ oc logs e2e-aws -c teardown --timestamps
  2018-10-04T18:17:06.557740109Z Gathering artifacts ...
  2018-10-04T18:17:24.875374828Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log)
  ...
  2018-10-04T18:17:29.331684772Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log)
  2018-10-04T18:17:29.351919855Z Error from server (NotFound): the server could not find the requested resource
  2018-10-04T18:17:39.592948165Z Error from server (BadRequest): previous terminated container "registry" in pod "registry-b6df966cf-fkhpl" not found
  ...
  2018-10-04T18:29:24.457841097Z Error from server (BadRequest): previous terminated container "kube-addon-operator" in pod "kube-addon-operator-775d4c8f8d-289zm" not found
  2018-10-04T18:29:24.466213055Z Waiting for node logs to finish ...
  2018-10-04T18:29:24.466289887Z Deprovisioning cluster ...
  2018-10-04T18:29:24.483065903Z level=debug msg="Deleting security groups"
  ...
  2018-10-04T18:33:29.857465158Z level=debug msg="goroutine deleteVPCs complete"

So 12 minutes to pull the logs, followed by four minutes for
destroy-cluster.

Looking at the extracted logs, lots of them are zero (which compresses
to 20 bytes):

  $ POD_LOGS="$(w3m -dump https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/artifacts/e2e-aws/pods/)"
  $ echo "${POD_LOGS}" | grep '^ *20$' | wc -l
  86
  $ echo "${POD_LOGS}" | grep '\[file\]' | wc -l
  172

So it's possible that the delay is due to the errors, or to a few
large logs blocking the old, serial pod/container pulls.

With this commit, I've added a new 'queue' command.  This command
checks to see how many background jobs we have using 'jobs' [2], and
idles until we get below 10.  Then it launches its particular command
in the background.  By using 'queue', we'll keep up to 10 log-fetches
running in parallel, and the final 'wait' will block for any which
still happen to be running by that point.

The previous gzip invocations used -c, which dates back to 82d333e
(Set up artifact reporting for ci-operator jobs, 2018-05-17, openshift#867).
But with these gzip filters running on stdin anyway, the -c was
superfluous.  I've dropped it in this commit.

Moving redirect target to a positional argument is a bit cludgy.  I'd
rather have a more familiar way of phrasing that redirect, but passing
it in as ${1} was the best I've come up with.

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/build-log.txt
[2]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/jobs.html
@wking wking force-pushed the parallel-log-retrieval branch from 8db1c76 to 13c41e7 Compare October 4, 2018 21:46
@wking
Copy link
Member Author

wking commented Oct 4, 2018

I grabbed the kubeconfig via the OpenShift console and ran this script successfully against ci-op-3dphr9zn.

sleep 1
LIVE="$(jobs | wc -l)"
done
echo "${@}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed interesting, but it makes the logs a bit noisier than they used to be. I'm fine dropping the echo if folks want quieter logs for these jobs.

# gather nodes first in parallel since they may contain the most relevant debugging info
while IFS= read -r i; do
mkdir -p /tmp/artifacts/nodes/$i
(
oc get --request-timeout=20s --raw /api/v1/nodes/$i/proxy/logs/messages | gzip -c > /tmp/artifacts/nodes/$i/messages.gz
FILTER=gzip queue /tmp/artifacts/nodes/$i/messages.gz oc get --request-timeout=20s --raw /api/v1/nodes/$i/proxy/logs/messages
oc get --request-timeout=20s --raw /api/v1/nodes/$i/proxy/logs/journal | sed -e 's|.*href="\(.*\)".*|\1|;t;d' > /tmp/journals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no queue here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no queue here?

Because we need /tmp/journals populated synchronously to feed the following while loop.

Copy link
Contributor

@abhinavdahiya abhinavdahiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 4, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 947d526 into openshift:master Oct 4, 2018
@openshift-ci-robot
Copy link
Contributor

@wking: Updated the prow-job-cluster-launch-installer-e2e configmap using the following files:

  • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

In response to this:

An alternative to #1815, this approach keeps up to 10 pulls going at once.

These are currently generating a lot of error messages. From recent openshift/installer#415 tests:

$ oc project ci-op-w11cl72x
$ oc logs e2e-aws -c teardown --timestamps
2018-10-04T18:17:06.557740109Z Gathering artifacts ...
2018-10-04T18:17:24.875374828Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log)
...
2018-10-04T18:17:29.331684772Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log)
2018-10-04T18:17:29.351919855Z Error from server (NotFound): the server could not find the requested resource
2018-10-04T18:17:39.592948165Z Error from server (BadRequest): previous terminated container "registry" in pod "registry-b6df966cf-fkhpl" not found
...
2018-10-04T18:29:24.457841097Z Error from server (BadRequest): previous terminated container "kube-addon-operator" in pod "kube-addon-operator-775d4c8f8d-289zm" not found
2018-10-04T18:29:24.466213055Z Waiting for node logs to finish ...
2018-10-04T18:29:24.466289887Z Deprovisioning cluster ...
2018-10-04T18:29:24.483065903Z level=debug msg="Deleting security groups"
...
2018-10-04T18:33:29.857465158Z level=debug msg="goroutine deleteVPCs complete"

So 12 minutes to pull the logs, followed by four minutes for destroy-cluster.

Looking at the extracted logs, lots of them are zero (which compresses to 20 bytes):

$ POD_LOGS="$(w3m -dump https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/artifacts/e2e-aws/pods/)"
$ echo "${POD_LOGS}" | grep '^ *20$' | wc -l
86
$ echo "${POD_LOGS}" | grep '\[file\]' | wc -l
172

So it's possible that the delay is due to the errors, or to a few large logs blocking the old, serial pod/container pulls.

With this pull request, I've added a new queue command. This command checks to see how many background jobs we have using jobs, and idles until we get below 10. Then it launches its particular command in the background. By using queue, we'll keep up to 10 log-fetches running in parallel, and the final wait will block for any which still happen to be running by that point.

The previous gzip invocations used -c, which dates back to 82d333e (#867). But with these gzip filters running on stdin anyway, the -c was superfluous. I've dropped it in this pull request.

Moving redirect target to a positional argument is a bit cludgy. I'd rather have a more familiar way of phrasing that redirect, but passing it in as ${1} was the best I've come up with.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Oct 5, 2018

With the new script:

2018/10/05 00:25:04 Container test in pod e2e-aws failed, exit code 1, reason Error
2018/10/05 00:33:46 Container teardown in pod e2e-aws completed successfully

So 9 minutes vs. the previous 12, and we still have logs. Not as dramatic an improvement as I'd hoped, and this run failed so the logs might be shorter than the previous run anyway. Maybe we can tune the number of parallel pulls, but withought more details on bandwidth and such, I'd just be guessing.

Edit: actually, 9 minutes vs. the previous 16. That's not so bad ;).

@wking
Copy link
Member Author

wking commented Oct 5, 2018

We may want to look into the history a bit more closely, because it looks like the recent speed-up is just getting us back to where we were at the beginning of the week. Here's a plot of e2e-aws duration over time for the past two weeks:

builds-early-october

You can see that we come back from openshift/installer#415 much slower than we were before. We switched over to the CVO at 2018-10-02T16:25Z, so there were a few fast runs after that before the outage. I forget exactly what kicked that outage off; I think there were a number of other fixes besides openshift/installer#415. Anyhow, I'll poke around in the logs and at least see which stage (setup, test, or teardown) got slower before this PR landed.

For kicks, here's a chart with e2e-aws durations going back to our first success.

builds

So the general trend is moving in the right direction ;). I've pushed the scripts and collected data up here if folks are interested.

@wking wking deleted the parallel-log-retrieval branch October 5, 2018 19:19
@wking
Copy link
Member Author

wking commented Oct 5, 2018

From a recent run (which includes #1831):

$ oc project ci-op-k2n3b7rw
$ oc logs e2e-aws -c teardown --timestamps
2018-10-05T20:54:31.215046482Z Gathering artifacts ...
2018-10-05T20:54:42.44196836Z oc --request-timeout=5s get nodes -o json
2018-10-05T20:54:42.444342909Z oc --request-timeout=5s get events --all-namespaces -o json
2018-10-05T20:54:42.448649504Z oc get --request-timeout=20s --raw /api/v1/nodes/ip-10-0-0-117.ec2.internal/proxy/logs/messages
2018-10-05T20:54:43.29532312Z Error from server (NotFound): the server could not find the requested resource
2018-10-05T20:54:43.403350622Z oc get --request-timeout=20s --raw /api/v1/nodes/ip-10-0-0-117.ec2.internal/proxy/logs/journal/b2b8c626002e4c35a8b0c660f81b0f8c/system.journal
2018-10-05T20:54:43.405958642Z oc get --request-timeout=20s --raw /api/v1/nodes/ip-10-0-0-117.ec2.internal/proxy/metrics
...
2018-10-05T20:55:23.184746943Z Error from server (BadRequest): container "package-server" in pod "package-server-f4d5bd459-7hdpg" is waiting to start: CreateContainerError
2018-10-05T20:55:23.18866575Z Error from server (BadRequest): previous terminated container "webconsole" in pod "webconsole-6698d4fbbc-674p8" not found
2018-10-05T20:55:23.323638427Z Deprovisioning cluster ...
...

So 52 seconds :).

@abhinavdahiya
Copy link
Contributor

@wking this is super slick 👍 we should definitely put these scripts in installer repo / this release repo.
like contrib directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants