-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-launch-installer-e2e: Pull all logs in parallel #1817
cluster-launch-installer-e2e: Pull all logs in parallel #1817
Conversation
With 10 pulls going at once. These are currently generating a lot of error messages. From recent openshift/installer#415 tests [1]: $ oc project ci-op-w11cl72x $ oc logs e2e-aws -c teardown --timestamps 2018-10-04T18:17:06.557740109Z Gathering artifacts ... 2018-10-04T18:17:24.875374828Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log) ... 2018-10-04T18:17:29.331684772Z Error from server (Forbidden): Forbidden (user=kube-apiserver, verb=get, resource=nodes, subresource=log) 2018-10-04T18:17:29.351919855Z Error from server (NotFound): the server could not find the requested resource 2018-10-04T18:17:39.592948165Z Error from server (BadRequest): previous terminated container "registry" in pod "registry-b6df966cf-fkhpl" not found ... 2018-10-04T18:29:24.457841097Z Error from server (BadRequest): previous terminated container "kube-addon-operator" in pod "kube-addon-operator-775d4c8f8d-289zm" not found 2018-10-04T18:29:24.466213055Z Waiting for node logs to finish ... 2018-10-04T18:29:24.466289887Z Deprovisioning cluster ... 2018-10-04T18:29:24.483065903Z level=debug msg="Deleting security groups" ... 2018-10-04T18:33:29.857465158Z level=debug msg="goroutine deleteVPCs complete" So 12 minutes to pull the logs, followed by four minutes for destroy-cluster. Looking at the extracted logs, lots of them are zero (which compresses to 20 bytes): $ POD_LOGS="$(w3m -dump https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/artifacts/e2e-aws/pods/)" $ echo "${POD_LOGS}" | grep '^ *20$' | wc -l 86 $ echo "${POD_LOGS}" | grep '\[file\]' | wc -l 172 So it's possible that the delay is due to the errors, or to a few large logs blocking the old, serial pod/container pulls. With this commit, I've added a new 'queue' command. This command checks to see how many background jobs we have using 'jobs' [2], and idles until we get below 10. Then it launches its particular command in the background. By using 'queue', we'll keep up to 10 log-fetches running in parallel, and the final 'wait' will block for any which still happen to be running by that point. The previous gzip invocations used -c, which dates back to 82d333e (Set up artifact reporting for ci-operator jobs, 2018-05-17, openshift#867). But with these gzip filters running on stdin anyway, the -c was superfluous. I've dropped it in this commit. Moving redirect target to a positional argument is a bit cludgy. I'd rather have a more familiar way of phrasing that redirect, but passing it in as ${1} was the best I've come up with. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/415/pull-ci-openshift-installer-master-e2e-aws/456/build-log.txt [2]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/jobs.html
8db1c76
to
13c41e7
Compare
I grabbed the kubeconfig via the OpenShift console and ran this script successfully against ci-op-3dphr9zn. |
sleep 1 | ||
LIVE="$(jobs | wc -l)" | ||
done | ||
echo "${@}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seemed interesting, but it makes the logs a bit noisier than they used to be. I'm fine dropping the echo
if folks want quieter logs for these jobs.
# gather nodes first in parallel since they may contain the most relevant debugging info | ||
while IFS= read -r i; do | ||
mkdir -p /tmp/artifacts/nodes/$i | ||
( | ||
oc get --request-timeout=20s --raw /api/v1/nodes/$i/proxy/logs/messages | gzip -c > /tmp/artifacts/nodes/$i/messages.gz | ||
FILTER=gzip queue /tmp/artifacts/nodes/$i/messages.gz oc get --request-timeout=20s --raw /api/v1/nodes/$i/proxy/logs/messages | ||
oc get --request-timeout=20s --raw /api/v1/nodes/$i/proxy/logs/journal | sed -e 's|.*href="\(.*\)".*|\1|;t;d' > /tmp/journals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why no queue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why no queue here?
Because we need /tmp/journals
populated synchronously to feed the following while
loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@wking: Updated the
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
So 9 minutes vs. the previous 12, and we still have logs. Not as dramatic an improvement as I'd hoped, and this run failed so the logs might be shorter than the previous run anyway. Maybe we can tune the number of parallel pulls, but withought more details on bandwidth and such, I'd just be guessing. Edit: actually, 9 minutes vs. the previous 16. That's not so bad ;). |
We may want to look into the history a bit more closely, because it looks like the recent speed-up is just getting us back to where we were at the beginning of the week. Here's a plot of e2e-aws duration over time for the past two weeks: You can see that we come back from openshift/installer#415 much slower than we were before. We switched over to the CVO at 2018-10-02T16:25Z, so there were a few fast runs after that before the outage. I forget exactly what kicked that outage off; I think there were a number of other fixes besides openshift/installer#415. Anyhow, I'll poke around in the logs and at least see which stage (setup, test, or teardown) got slower before this PR landed. For kicks, here's a chart with e2e-aws durations going back to our first success. So the general trend is moving in the right direction ;). I've pushed the scripts and collected data up here if folks are interested. |
From a recent run (which includes #1831): $ oc project ci-op-k2n3b7rw
$ oc logs e2e-aws -c teardown --timestamps
2018-10-05T20:54:31.215046482Z Gathering artifacts ...
2018-10-05T20:54:42.44196836Z oc --request-timeout=5s get nodes -o json
2018-10-05T20:54:42.444342909Z oc --request-timeout=5s get events --all-namespaces -o json
2018-10-05T20:54:42.448649504Z oc get --request-timeout=20s --raw /api/v1/nodes/ip-10-0-0-117.ec2.internal/proxy/logs/messages
2018-10-05T20:54:43.29532312Z Error from server (NotFound): the server could not find the requested resource
2018-10-05T20:54:43.403350622Z oc get --request-timeout=20s --raw /api/v1/nodes/ip-10-0-0-117.ec2.internal/proxy/logs/journal/b2b8c626002e4c35a8b0c660f81b0f8c/system.journal
2018-10-05T20:54:43.405958642Z oc get --request-timeout=20s --raw /api/v1/nodes/ip-10-0-0-117.ec2.internal/proxy/metrics
...
2018-10-05T20:55:23.184746943Z Error from server (BadRequest): container "package-server" in pod "package-server-f4d5bd459-7hdpg" is waiting to start: CreateContainerError
2018-10-05T20:55:23.18866575Z Error from server (BadRequest): previous terminated container "webconsole" in pod "webconsole-6698d4fbbc-674p8" not found
2018-10-05T20:55:23.323638427Z Deprovisioning cluster ...
... So 52 seconds :). |
@wking this is super slick 👍 we should definitely put these scripts in installer repo / this release repo. |
An alternative to #1815, this approach keeps up to 10 pulls going at once.
These are currently generating a lot of error messages. From recent openshift/installer#415 tests:
So 12 minutes to pull the logs, followed by four minutes for
destroy-cluster
.Looking at the extracted logs, lots of them are zero (which compresses to 20 bytes):
So it's possible that the delay is due to the errors, or to a few large logs blocking the old, serial pod/container pulls.
With this pull request, I've added a new
queue
command. This command checks to see how many background jobs we have usingjobs
, and idles until we get below 10. Then it launches its particular command in the background. By usingqueue
, we'll keep up to 10 log-fetches running in parallel, and the finalwait
will block for any which still happen to be running by that point.The previous
gzip
invocations used-c
, which dates back to 82d333e (#867). But with thesegzip
filters running on stdin anyway, the-c
was superfluous. I've dropped it in this pull request.Moving redirect target to a positional argument is a bit cludgy. I'd rather have a more familiar way of phrasing that redirect, but passing it in as
${1}
was the best I've come up with.