docs/user: add troubleshootingbootstrap to define the bootstrap log bundle #3506

abhinavdahiya · 2020-04-24T20:36:12Z

This adds a document that provides,

structural information about various file in the bootstrap log bundle
some common failures that can be troubleshooted using the bootstrap log bundle

/cc @openshift/openshift-team-installer

abhinavdahiya · 2020-04-26T17:00:17Z

/test

openshift-ci-robot · 2020-04-26T17:00:31Z

@abhinavdahiya: The /test command needs one or more targets.
The following commands are available to trigger jobs:

/test e2e-aws
/test e2e-aws-disruptive
/test e2e-aws-fips
/test e2e-aws-proxy
/test e2e-aws-rhel8
/test e2e-aws-scaleup-rhel7
/test e2e-aws-shared-vpc
/test e2e-aws-upgrade
/test e2e-aws-upi
/test e2e-azure
/test e2e-azure-shared-vpc
/test e2e-azure-upi
/test e2e-gcp
/test e2e-gcp-shared-vpc
/test e2e-gcp-upgrade
/test e2e-gcp-upi
/test e2e-libvirt
/test e2e-metal
/test e2e-metal-ipi
/test e2e-openstack
/test e2e-openstack-parallel
/test e2e-ovirt
/test e2e-vsphere
/test e2e-vsphere-upi
/test gofmt
/test golint
/test govet
/test images
/test shellcheck
/test tf-fmt
/test tf-lint
/test unit
/test verify-vendor
/test yaml-lint

Use /test all to run all jobs.

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abhinavdahiya · 2020-04-26T17:00:34Z

/test all

wking · 2020-04-26T20:39:42Z

/refresh
/retest
/test unit

wking · 2020-04-26T20:41:09Z

/test yaml-lint
/test verify-vendor
/test tf-lint
/test shellcheck
/test images
/test govet
/test golint
/test gofmt
/test e2e-aws-upgrade

wking · 2020-04-26T20:49:23Z

docs/user/troubleshooting.md

@@ -77,6 +77,8 @@ The most important thing to look at on the bootstrap node is `bootkube.service`.
 1. If SSH is available, the following command can be run on the bootstrap node: `journalctl --unit=bootkube.service`
 2. Regardless of whether or not SSH is available, the following command can be run: `curl --insecure --cert ${INSTALL_DIR}/tls/journal-gatewayd.crt --key ${INSTALL_DIR}/tls/journal-gatewayd.key 'https://${BOOTSTRAP_IP}:19531/entries?follow&_SYSTEMD_UNIT=bootkube.service'`

+The installer can also gather a log bundle from the bootstrap host using SSH as describe in [troubleshootingbootstrap][./troubleshootingbootstap.md] document.


nit: Maybe as described [here](troubleshootingbootstrap.md).? But regardless of what you use as the link text, the URI should go in parens, because you have an inline link, not a reference-style link.

I like showing the filename to the user... but the syntax needs to be fixed or it doesn't work.

docs/user/troubleshootingbootstrap.md

wking · 2020-04-26T20:53:46Z

docs/user/troubleshootingbootstrap.md

+
+1. Use the user's already setup `SSH_AGENT`. If the user has a ssh-agent setup, the installer will use it for SSH authentication.
+
+2. Use the user'd home directory, `~/.ssh` on linux hosts, to load all the SSH private keys and use those for SSH authentication.


nit: "linux" -> "Linux"

wking · 2020-04-26T20:57:51Z

docs/user/troubleshootingbootstrap.md

+
+### directory: unit-status
+
+The unit-status directory contains the details of each failed systemd unit from [failed-units][#file-failed-units-txt]


nit: [failed-units](#file-failed-units-txt). (braces -> parens and trailing period).

wking · 2020-04-26T20:58:53Z

docs/user/troubleshootingbootstrap.md

+
+### directory: bootstrap
+
+The bootstrap directory consists of all the important logs and files from the bootstrap host. There are 2 sub directories for the bootstrap host


nit: "sub directories" -> "subdirectories". And maybe want a trailing colon :.

Also, example has three subdirectories, not two. Maybe just say "The subdirectories are:"?

wking · 2020-04-26T21:02:05Z

docs/user/troubleshootingbootstrap.md

+* `crio-configure.log` and `crio.log`, these units are responsible for configuring the CRI-O on the bootstrap host and CRI-O daemon respectively.
+* `kubelet.log`, the kubelet service is responsible for running the kubelet on the bootstrap host. The kubelet on the bootstrap host is responsible for running the static pods for etcd, bootstrap-kube-controlplane and various other operators in bootstrap mode.
+* `approve-csr.log`, the approve-csr unit is responsible for allowing control-plane machines to join OpenShift cluster. This unit performs the job of in-cluster approver while the bootstrapping is in progress.
+


nit: your previous list entries had no intervening blank lines; probably drop this one for consistency.

wking · 2020-04-26T21:04:46Z

docs/user/troubleshootingbootstrap.md

+12 directories, 3 files
+```
+
+#### directory: control-plane/*/containers


nit: Markdown (at least GitHub's PR/files renderer) thinks this * is the beginning of an italics span. You can backtick your paths like control-plane/*/containers to avoid confusing it.

wking · 2020-04-26T21:05:20Z

docs/user/troubleshootingbootstrap.md

+
+#### directory: control-plane/*/containers
+
+The containers directory contains the descriptions and logs from all the containers created by the kubelet using CRIO on the control-plane host. The files are same as [containers directory][#directory-bootstrap-containers] on bootstrap host.


nit: another braces -> parent inline link. Also "CRIO" -> "CRI-O"

wking · 2020-04-26T21:07:23Z

docs/user/troubleshootingbootstrap.md

+* `kubelet.log`
+* `machine-config-daemon-host.log` and `pivot.log`, these files have logs for RHCOS pivot related actions on the control plane host.
+
+## Common Failures


I would still like the installer to grow diagnostics for common failures (#2569). Any thoughts about whether we can get an up/down decision on that direction once 4.6 splits off from master?

wking · 2020-04-26T21:08:47Z

docs/user/troubleshootingbootstrap.md

+-- No entries --
+```
+
+There is high likelyhood that the Release Image cannot be downloaded and more details can be found using [release-image.log][#unable-to-pull-release-image]


nit: another braces -> parents inline link.

abhinavdahiya · 2020-04-27T16:52:10Z

/assign @jstuever

jstuever

Some minor changes requested. Also, I'd like to see an actual troubleshooting workflow of some sort... such as 1) confirm images are downloading, 2) confirm etcd is up... etc... IMHO that is the most value we can add here.

jstuever · 2020-04-27T21:35:09Z

docs/user/troubleshooting.md

@@ -77,6 +77,8 @@ The most important thing to look at on the bootstrap node is `bootkube.service`.
 1. If SSH is available, the following command can be run on the bootstrap node: `journalctl --unit=bootkube.service`
 2. Regardless of whether or not SSH is available, the following command can be run: `curl --insecure --cert ${INSTALL_DIR}/tls/journal-gatewayd.crt --key ${INSTALL_DIR}/tls/journal-gatewayd.key 'https://${BOOTSTRAP_IP}:19531/entries?follow&_SYSTEMD_UNIT=bootkube.service'`

+The installer can also gather a log bundle from the bootstrap host using SSH as describe in [troubleshootingbootstrap][./troubleshootingbootstap.md] document.


I like showing the filename to the user... but the syntax needs to be fixed or it doesn't work.

jstuever · 2020-04-27T21:37:25Z

docs/user/troubleshootingbootstrap.md

+
+The installer will use the user's environment to discover the credentials to connect to the bootstrap host over SSH. One of the following methods is used by the installer,
+
+1. Use the user's already setup `SSH_AGENT`. If the user has a ssh-agent setup, the installer will use it for SSH authentication.


Use an already setup...

jstuever · 2020-04-27T21:38:46Z

docs/user/troubleshootingbootstrap.md

+
+1. Use the user's already setup `SSH_AGENT`. If the user has a ssh-agent setup, the installer will use it for SSH authentication.
+
+2. Use the user'd home directory, `~/.ssh` on Linux hosts, to load all the SSH private keys and use those for SSH authentication.


user's
Also, clarify that this only happens if the SSH_ANGENT isn't already running.

Also, clarify that this only happens if the SSH_ANGENT isn't already running.

https://github.com/openshift/installer/pull/3506/files#diff-135e3d860b56722d4c6282c25380d24dR13 already says One of

jstuever · 2020-04-27T21:40:11Z

docs/user/troubleshootingbootstrap.md

+1. Use the user's already setup `SSH_AGENT`. If the user has a ssh-agent setup, the installer will use it for SSH authentication.
+
+2. Use the user'd home directory, `~/.ssh` on Linux hosts, to load all the SSH private keys and use those for SSH authentication.
+    a. The installer also configures the bootstrap host with a *generated* SSH key, and this private key will be used for SSH authentication none of the user keys are trusted.


The placement feels odd... should this be 3.?

this is not 3, this is only valid in case of the discovering keys, if SSH_AGENT is set, we don't do any discovering.

jstuever · 2020-04-27T21:41:30Z

docs/user/troubleshootingbootstrap.md

+
+When users are using the installer to create the OpenShift cluster, the installer has all the information to automatically capture the logs from bootstrap host in case of failure.
+
+#### Authenticating with bootstrap host for ipi


Authenticating to the bootstrap host

abhinavdahiya · 2020-04-27T22:05:21Z

Also, I'd like to see an actual troubleshooting workflow of some sort... such as 1) confirm images are downloading, 2) confirm etcd is up... etc... IMHO that is the most value we can add here.

@jstuever The goal is to tell people if the failure is due to one of the reasons. the users can see which one applies to them.

a worlkfow of what you should look isn't just possible because there are too many moving parts and people using respond better to symptons instead of path.

my bootstrap failed, was it is because control-plane machines didn't join? that is more easy to link to and define. instead of, hey let's go on a ride of flow-chart.

openshift-ci-robot · 2020-04-28T06:28:26Z

@abhinavdahiya: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/yaml-lint	fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9	link	`/test yaml-lint`
ci/prow/golint	fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9	link	`/test golint`
ci/prow/gofmt	fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9	link	`/test gofmt`
ci/prow/govet	fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9	link	`/test govet`
ci/prow/verify-vendor	fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9	link	`/test verify-vendor`
ci/prow/unit	fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9	link	`/test unit`
ci/prow/images	fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9	link	`/test images`
ci/prow/e2e-ovirt	fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9	link	`/test e2e-ovirt`
ci/prow/e2e-aws-scaleup-rhel7	fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9	link	`/test e2e-aws-scaleup-rhel7`
ci/prow/e2e-metal-ipi	fdfe31b00ff1d8caecc73b14b7f9d109361c6cc9	link	`/test e2e-metal-ipi`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

abhinavdahiya · 2020-05-01T16:15:02Z

ping @jstuever for review

patrickdillon

left some proofreading suggestions

patrickdillon · 2020-05-05T11:16:14Z

docs/user/troubleshooting.md

@@ -77,6 +77,8 @@ The most important thing to look at on the bootstrap node is `bootkube.service`.
 1. If SSH is available, the following command can be run on the bootstrap node: `journalctl --unit=bootkube.service`
 2. Regardless of whether or not SSH is available, the following command can be run: `curl --insecure --cert ${INSTALL_DIR}/tls/journal-gatewayd.crt --key ${INSTALL_DIR}/tls/journal-gatewayd.key 'https://${BOOTSTRAP_IP}:19531/entries?follow&_SYSTEMD_UNIT=bootkube.service'`

+The installer can also gather a log bundle from the bootstrap host using SSH as describe in [troubleshootingbootstrap](./troubleshootingbootstap.md) document.


Link does not work.

as describe in [troubleshootingbootstrap] -> as described in the [troubleshooting bootstrap]

docs/user/troubleshootingbootstrap.md

docs/user/troubleshooting.md

abhinavdahiya · 2020-05-05T16:29:30Z

@patrickdillon Thanks for the review, updated the PR! :)

…undle

jstuever · 2020-05-06T15:27:31Z

hey let's go on a ride of flow-chart.

I was thinking more of a high-level flow-chart.... pre-installation, wait-for bootstrap, wait-for install... to help direct what the user should be doing to troubleshoot and concentrate on which errors might be applicable to the user. However, in hind-sight, this is likely beyond the scope of this particular story.

jstuever · 2020-05-06T15:28:04Z

/lgtm

jstuever · 2020-05-06T15:28:47Z

/retest

abhinavdahiya · 2020-05-06T18:21:42Z

/approve

openshift-ci-robot · 2020-05-06T18:22:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot requested a review from a team April 24, 2020 20:36

abhinavdahiya added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. retest-not-required-docs-only labels Apr 24, 2020

abhinavdahiya closed this Apr 24, 2020

abhinavdahiya reopened this Apr 24, 2020

wking reviewed Apr 26, 2020

View reviewed changes

docs/user/troubleshootingbootstrap.md Show resolved Hide resolved

wking reviewed Apr 26, 2020

View reviewed changes

abhinavdahiya force-pushed the bg_doc branch from 80924a6 to a520530 Compare April 27, 2020 16:51

openshift-ci-robot assigned jstuever Apr 27, 2020

jstuever requested changes Apr 27, 2020

View reviewed changes

abhinavdahiya force-pushed the bg_doc branch 2 times, most recently from a520530 to 120544e Compare April 28, 2020 05:49

patrickdillon reviewed May 5, 2020

View reviewed changes

abhinavdahiya commented May 5, 2020

View reviewed changes

docs/user/troubleshooting.md Outdated Show resolved Hide resolved

abhinavdahiya removed the retest-not-required-docs-only label May 5, 2020

abhinavdahiya closed this May 5, 2020

abhinavdahiya reopened this May 5, 2020

docs/user: add troubleshootingbootstrap to define the bootstrap log b…

56f0d24

…undle

abhinavdahiya force-pushed the bg_doc branch from b9f8f22 to 56f0d24 Compare May 5, 2020 16:36

abhinavdahiya added the retest-not-required-docs-only label May 5, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 6, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 6, 2020

openshift-merge-robot merged commit 51bbaa4 into openshift:master May 6, 2020

abhinavdahiya deleted the bg_doc branch May 6, 2020 18:26


		1. Use the user's already setup `SSH_AGENT`. If the user has a ssh-agent setup, the installer will use it for SSH authentication.

		2. Use the user'd home directory, `~/.ssh` on linux hosts, to load all the SSH private keys and use those for SSH authentication.


		### directory: unit-status

		The unit-status directory contains the details of each failed systemd unit from [failed-units][#file-failed-units-txt]


		### directory: bootstrap

		The bootstrap directory consists of all the important logs and files from the bootstrap host. There are 2 sub directories for the bootstrap host


		#### directory: control-plane/*/containers

		The containers directory contains the descriptions and logs from all the containers created by the kubelet using CRIO on the control-plane host. The files are same as [containers directory][#directory-bootstrap-containers] on bootstrap host.


		The installer will use the user's environment to discover the credentials to connect to the bootstrap host over SSH. One of the following methods is used by the installer,

		1. Use the user's already setup `SSH_AGENT`. If the user has a ssh-agent setup, the installer will use it for SSH authentication.


		When users are using the installer to create the OpenShift cluster, the installer has all the information to automatically capture the logs from bootstrap host in case of failure.

		#### Authenticating with bootstrap host for ipi

docs/user: add troubleshootingbootstrap to define the bootstrap log bundle #3506

docs/user: add troubleshootingbootstrap to define the bootstrap log bundle #3506

Conversation

abhinavdahiya commented Apr 24, 2020

abhinavdahiya commented Apr 26, 2020

openshift-ci-robot commented Apr 26, 2020

abhinavdahiya commented Apr 26, 2020

wking commented Apr 26, 2020

wking commented Apr 26, 2020

Choose a reason for hiding this comment

jstuever Apr 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhinavdahiya commented Apr 27, 2020

jstuever left a comment

Choose a reason for hiding this comment

jstuever Apr 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhinavdahiya Apr 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhinavdahiya commented Apr 27, 2020 • edited Loading

openshift-ci-robot commented Apr 28, 2020

abhinavdahiya commented May 1, 2020

patrickdillon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhinavdahiya commented May 5, 2020

jstuever commented May 6, 2020

jstuever commented May 6, 2020

jstuever commented May 6, 2020

abhinavdahiya commented May 6, 2020

openshift-ci-robot commented May 6, 2020

jstuever Apr 27, 2020 •

edited

Loading

jstuever Apr 27, 2020 •

edited

Loading

abhinavdahiya Apr 27, 2020 •

edited

Loading

abhinavdahiya commented Apr 27, 2020 •

edited

Loading