must-gather
is a tool built on top of OpenShift must-gather
that provides the scripts for an OpenStack control plane logs and data collection.
oc adm must-gather --image=quay.io/openstack-k8s-operators/openstack-must-gather
The command above will create a local directory where logs, configs and status of the OpenStack control plane services are dumped.
In particular the openstack-must-gather
will get a dump of:
- Service logs: Retrieved by the output of the pods (and operators) associated to the deployed services
- Services config: Retrieved for each component by the deployed
ConfigMaps
andSecrets
- Status of the services deployed in the OpenStack control plane
- Deployed CRs and CRDs
CSVs
,pkgmanifests
,subscriptions
,installplans
,operatorgroup
Pods
,Deployments
,Statefulsets
,ReplicaSets
,Service
,Routes
,ConfigMaps
, (part of / relevant)Secrets
- Network related info (
Metallb
info,IPAddressPool
,L2Advertisements
,NetConfig
,IPSet
) - SOS reports for OpenShift nodes that are running OpenStack service pods.
Some openstack-must-gather collectors can be configured via environmental
variables to behave differently. For example SOS gathering can be disabled
passing an empty SOS_SERVICES
environmental variable.
To provide environmental variables we'll need to invoke the gathering command manually like this:
oc adm must-gather --image=quay.io/openstack-k8s-operators/openstack-must-gather -- SOS_SERVICES= gather
This is the list of available environmental variables:
OSP_NS
: Namespace where the OSP services are running. Defaults toopenstack
.OSP_OPERATORS_NS
: Namespace where the OSP operators are running. Defaults toopenstack-operators
.CONCURRENCY
: Must gather runs many operations, so to speed things up we run them in parallel with a concurrency of 5 by default. Users can change this environmental variable to adjust to its needs.SOS_SERVICES
: Comma separated list of services to gather SOS reports from. Empty string skips sos report gathering. Eg:cinder,glance
. Defaults to all of them.SOS_ONLY_PLUGINS
: List of SOS report plugins to use. Empty string to run them all. Defaults to:block,cifs,crio,devicemapper,devices,iscsi,lvm2, memory,multipath,nfs,nis,nvme,podman,process,processor,selinux,scsi,udev
.SOS_EDPM
: Comma separated list of edpm nodes to gather SOS reports from, empty string skips sos report gathering. Accepts keyword all to gather all nodes. eg:edpm-compute-0,edpm-compute-1
SOS_EDPM_PROFILES
: List of sos report profiles to use. Empty string to run them all. Defaults to:container,openstack_edpm,system,storage,virt
SOS_EDPM_PLUGINS
: List of sos report plugins to use. This is optional.OPENSTACK_DATABASES
: comma separated list of OpenStack databases that should be dumped. It is possible to set it toALL
and dump all databases. By default this env var is unset, hence the database dump is skipped.ADDITIONAL_NAMESPACES
: comma separated list of additional namespaces where we want to gather the associated resources.DO_NOT_MASK
: This is an option for CI only purposes. It's set to 0 by default (and preserves the default behavior required in a production environment). However, if set to 1, it dumps secrets and services config files without masking sensitive data.
openstack-must-gather is capable of getting both the kubernetes resources
defined in the collection-scripts, and the sos-reports associated with both the
CoreOS nodes and the EDPM ones.
When the openstack-must-gather
execution ends, a directory containing all the
gathered resources is generated, and in general it contains:
-
Global resources: useful to get some context about the status of the openshift cluster and the openstack deployed resources. These resources include
crds
,apiservices
,csvs
,packagemanifests
,webhooks
andnetwork
related informations likenncp
,nnce
,IPAddressPool
, and so forth -
Namespaced resources: critical to get the status of the
OpenStack
cluster and troubleshoot any problematic situation -
sos-reports: gathered from both the
CoreOS
nodes that are part of theOpenShift
cluster, and theEDPM
nodes in case are part of the cluster; the information to connect to the EDPM nodes is retrieved by theOpenStackDataplaneNodeSets
CR, and the resulting sos-report is retrieved from the remote nodes and downloaded in the current must-gather directory -
OpenStack Ctlplane Services: commands run through the
openstack-cli
to check the relevant resources generated within the OpenStack cluster (endpoint list
,networks
,subnets
,registered services
, etc)
A generic output of the openstack-must-gather
execution looks like the
following:
+-----------------------------------+
| . | +-----------------------------+
| ├── apiservices | | ctlplane/neutron/ |
| ├── crd | | ├── agent_list |
| ├── csv | (control plane resources) | ├── extension_list |
| ├── ctlplane |------------------------------------| ├── floating_ip_list |
| │ ├── neutron | | ├── network_list |
| │ ├── nova |----------------- | ├── port_list |
| │ └── placement | | | ├── router_list |
| ├── dbs | +---------------------------+ | ├── security_group_list |
| ├── namespaces | | namespaces/openstack/ | | └── subnet_list |
| │ ├── cert-manager | | ├── all_resources.log | +-----------------------------+
| │ ├── openshift-machine-api | | ├── buildconfig |-----------------------------------
| │ ├── openshift-nmstate | | ├── configmaps | |
| │ ├── openstack | | ├── cronjobs | +--------------------------------------------------------------------+
| │ └── openstack-operators | | ├── crs | | namespaces/openstack/secrets/glance/ |
| ├── network | | ├── daemonset | | ├── cert-glance-default-public-route.yaml |
| │ ├── ipaddresspools | | ├── deployments | | ├── glance-config-data.yaml |
| │ ├── nnce | | ├── events.log | | ├── glance-config-data.yaml-00-config.conf |
| │ └── nncp | | ├── installplans | | ├── glance-default-single-config-data.yaml |
| ├── nodes | | ├── jobs | | ├── glance-default-single-config-data.yaml-00-config.conf |
| ├── sos-reports | | ├── nad.log | | ├── glance-default-single-config-data.yaml-10-glance-httpd.conf |
| │ ├── _all_nodes | | ├── pods | | ├── glance-default-single-config-data.yaml-httpd.conf |
| │ ├── barbican | | ├── pvc.log | | ├── glance-default-single-config-data.yaml-ssl.conf |
| │ ├── ceilometer | | ├── replicaset | | └── glance-scripts.yaml |
| │ ├── glance | | ├── routes | +--------------------------------------------------------------------+
| │ ├── keystone | | ├── secrets | |
| │ ├── neutron | | ├── services | +--------------------------------------------------------------------+
| │ ├── nova | | ├── statefulsets | | Note: if DO_NOT_MASK is passed in CI, secrets are dumped without |
| │ ├── ovn | | └── subscriptions | | hiding any sensitive information. |
| │ ├── ovs | +---------------------------+ +--------------------------------------------------------------------+
| │ ├── placement |
| │ └── swift |
| └── webhooks |
| ├── mutating |
| └── validating |
+-----------------------------------+
In a troubleshooting session, however, it's critical to check and analyze not
only Secrets
and services config files, but also the CRs
associated with
each service and the Pod
logs.
These are still namespaced resources, and they can be found in the CRs and Pods
directories.
Other than that, for each namespace some generic informations is collected. In
particular the openstack-must-gather
tool is able to retrieve:
Events
recorded for the current namespaceNetwork Attachment Definitions
PVCs
attached to the deployed Pods- A picture of the namespaces in terms of deployed resources (
all_resources.log
)
+---------------------------+
| namespaces/openstack/ | ------------------------------------
| ├── buildconfig | |
| ├── cronjobs | +--------------------------------------------------------+
| ├── crs | | namespaces/openstack/crs/ |
| ├── daemonset | | ├── barbicanapis.barbican.openstack.org |
| ├── deployments | | ├── barbicankeystonelisteners.barbican.openstack.org |
| ├── events.log | | ├── barbicans.barbican.openstack.org |
| ├── installplans | | ├── barbicanworkers.barbican.openstack.org |
| ├── jobs | | ... |
| ├── nad.log | | ... |
| ├── pods | | ├── glanceapis.glance.openstack.org |
| ├── all_resources.log | | └── glance-default-single.yaml |
| ├── configmaps | | ├── glances.glance.openstack.org |
| ├── pvc.log | | └── glance.yaml |
| ├── replicaset | | ├── keystoneapis.keystone.openstack.org |
| ├── routes | | ├── keystoneendpoints.keystone.openstack.org |
| ├── secrets | | ├── keystoneservices.keystone.openstack.org |
| ├── services | | ... |
| ├── statefulsets | | ├── telemetries.telemetry.openstack.org |
| └── subscriptions | | └── transporturls.rabbitmq.openstack.org |
+---------------------------+ +--------------------------------------------------------+
As depicted in the schema above, the same pattern applies to the Pod resources.
For each Pod
the openstack-must-gather tool is able to retrieve the
description and the associated logs (including -previous
in case the Pod is
in a CrashLookBackoff
status).
+---------------------------+
| namespaces/openstack/ | ------------------------------------
| ├── buildconfig | |
| ├── cronjobs | +-----------------------------------------------------------+
| ├── crs | | namespaces/openstack/pods/glance-dbpurge-28500481-f4jk9 |
| ├── daemonset | | ├── glance-dbpurge-28500481-f4jk9-describe |
| ├── deployments | | └── logs |
| ├── events.log | | └── glance-dbpurge.log |
| ├── installplans | | namespaces/openstack/pods/glance-default-single-0 |
| ├── jobs | | ├── glance-default-single-0-describe |
| ├── nad.log | | └── logs |
| ├── pods | | ├── glance-api.log |
| ├── all_resources.log | | ├── glance-httpd.log |
| ├── configmaps | | └── glance-log.log |
| ├── pvc.log | | namespaces/openstack/pods/glance-default-single-1 |
| ├── replicaset | | ├── glance-default-single-1-describe |
| ├── routes | | └── logs |
| ├── secrets | | ├── glance-api.log |
| ├── services | | ├── glance-httpd.log |
| ├── statefulsets | | └── glance-log.log |
| └── subscriptions | | namespaces/openstack/pods/glance-default-single-2 |
+---------------------------+ | ├── glance-default-single-2-describe |
| └── logs |
| ├── glance-api.log |
| ├── glance-httpd.log |
| └── glance-log.log |
+-----------------------------------------------------------+
You can build the image locally using the Dockerfile included.
A Makefile
is also provided. To use it, you must pass:
- an image name using the variable
MUST_GATHER_IMAGE
. - an image registry using the variable
IMAGE_REGISTRY
(default is quay.io/openstack-k8s-operators) - an image tag using the variable
IMAGE_TAG
(default islatest
)
The targets for make
are as follows:
check-image
: Check if theMUST_GATHER_IMAGE
variable is setbuild
: build the image with the supplied name and pushes itcheck
: Run sanity check against the script collectionpytest
: Run sanity check and unit tests against the python script collectionpodman-build
: builds the must-gather imagepodman-push
: pushes an already-builtmust-gather
image
One possible workflow that can be used for development is to run the openstack must-gather in debug mode:
$ oc adm must-gather --image=quay.io/openstack-k8s-operators/openstack-must-gather -- gather_debug
[must-gather ] OUT Using must-gather plug-in image: quay.io/openstack-k8s-operators/openstack-must-gather:latest
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 6ffe21e8-7dc9-4719-926f-f34ec33e6916
ClusterVersion: Stable at "4.12.35"
ClusterOperators:
All healthy and stable
[must-gather ] OUT namespace/openshift-must-gather-b9fpw created
[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-mq6st created
[must-gather ] OUT pod for plug-in image quay.io/openstack-k8s-operators/openstack-must-gather:latest created
[must-gather-mxc7k] POD 2023-09-29T09:56:24.995284210Z Must gather entering debug mode, will sleep until file /tmp/rm-to-finish-gathering is deleted
[must-gather-mxc7k] POD 2023-09-29T09:56:24.995284210Z
Running in debug mode makes the must gather container just sit waiting for a
file to be removed, allowing us to go into the container and test our scripts.
In the above case, where the namespace is openshift-must-gather-b9fpw
and the
pod name is must-gather-mxc7k
we would enter the container with:
oc -n openshift-must-gather-b9fpw rsh must-gather-mxc7k
And then if we were debugging a bash script called gather_trigger_gmr
, we
would run it in debug mode:
sh-5.1# bash -x /usr/bin/gather_trigger_gmr
And once we had the script working as intended we would copy the file from a terminal:
oc cp openshift-must-gather-b9fpw/must-gather-mxc7k:usr/bin/gather_trigger_gmr collection-scripts/gather_trigger_gmr
And finally, from inside the container shell we let the oc adm must-gather
command complete, optionally running everything to ensure we haven't
inadvertently broken anything in the process:
sh-5.1# gather
sh-5.1# rm /tmp/rm-to-finish-gathering
Besides the generic OpenShift etcd objects that must-gather scripts are currently gathering there's also component specific code that gather other information, so when adding or improving a component we should look into:
-
collection-scripts/common.sh
: Services are in theOSP_SERVICES
variable to indicate that the script must gather its config maps, secrets, additional information, trigger Guru Mediation Reports, etc. -
collection-scripts/gather_services_status
: Runs OpenStack commands to gather additional information. For example for Cinder it gather state of the services, volume types, qos specs, volume transfer requests, summary of existing volumes, pool information, etc. -
collection-scripts/gather_trigger_gmr
: Triggers Guru Meditation Reports on the services so they are present in the logs when those are gathered afterwards.
(optional) Build and push the must-gather image to a registry:
git clone ssh://[email protected]/openstack-k8s-operators/openstack-must-gather.git
cd openstack-must-gather
IMAGE_TAG=<tag> IMAGE_REGISTRY=<registry> MUST_GATHER_IMAGE=openstack-must-gather make build
On a machine where you have oc adm
access, do the following:
oc adm must-gather --image=<registry>/openstack-must-gather:<tag>
When generation is finished, you will find the dump in the current directory
under must-gather.local.XXXXXXXXXX
.
It is possible to gather debugging information about specific features by adding
to the oc adm must-gather
command the --image-stream
argument.
The must-gather tool supports multiple images, so you can gather data about more
than one feature by running a single command.
Adding --image-stream
assumes that an image has been imported in an OpenShift
namespace and the associated imagestream
is available.
To import an image as an imagestream
, run the following commands:
oc project <namespace>
oc import-image <registry>/<image>:<tag> --confirm
and double check the imagestream
exists in the specified namespace.
At this point, run the oc adm must-gather
command passing one or more
imagestream parameters.
For instance, assuming we import the kubevirt must-gather image:
within the existing openstack
namespace:
oc project openstack
oc import-image kubevirt-must-gather --from=quay.io/kubevirt/must-gather:latest --confirm
we can combine, in a single command, three different containers executions that gather different aspects of the same environement.
oc adm must-gather --image-stream=openstack/kubevirt-must-gather \
--image-stream=openshift/must-gather \
--image=quay.io/openstack-k8s-operators/openstack-must-gather:latest
The command above will create three pods associated to the existing imagestream
objects that point to openstack
, openshift
and kubevirt
must-gather container
images.
[must-gather] OUT pod for plug-in image quay.io/openstack-k8s-operators/openstack-must-gather:latest created
[must-gather] OUT pod for plug-in image quay.io/kubevirt/must-gather@sha256:501e30ac7d5b9840a918bb9e5aa830686288ccfeee37d70aaf99cd2e302a2bb0 created
[must-gather] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e9601b492cbb375f0a05310efa6025691f8bba6a97667976cd4baf4adf0f244c created
...
...
To gather config files and logs from hosts we need to rely on the sos
tool (former sosreport
).
We can obtain logs from both OCP cluster nodes and compute nodes.
First we need to gather cluster host names
$ oc get nodes -o name
node/master-0
node/master-1
node/master-2
Then we need to login into a node via a debug container
$ oc debug node/master-0
Warning: would violate PodSecurity "restricted:v1.24": host namespaces (hostNetwork=true, hostPID=true, hostIPC=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
Starting pod/master-0-debug-lpvnn ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.111.20
If you don't see a command prompt, try pressing enter.
sh-4.4#
We need to chroot into the node itself in order to have access to all commands
sh-4.4# chroot /host
sh-5.1#
We need to spawn a toolbox container to use sos
tool.
sh-5.1# toolbox
Trying to pull registry.redhat.io/rhel9/support-tools:latest...
Getting image source signatures
Checking if image destination supports signatures
Copying blob bf237f774da8 done
Copying blob 4b36affe1d29 done
Copying config d184fda91f done
Writing manifest to image destination
Storing signatures
d184fda91f0aa5c6deed433f984bd393754f707a285445a83637aaa13b8b7e86
Spawning a container 'toolbox-root' with image 'registry.redhat.io/rhel9/support-tools'
Detected RUN label in the container image. Using that as the default...
57305547f44861347e453b535fd59a56b4d9c0a3f472b7f1d86c2f246c94a5ea
toolbox-root
Container started successfully. To exit, type 'exit'.
[root@master-0 /]#
Finally we can collect all required config/logs
[root@master-0 /]# sos report -k crio.all=on -k crio.logs=on
<snip output>
Finished running plugins
Creating compressed archive...
Your sosreport has been generated and saved in:
/host/var/tmp/sosreport-master-0-2023-11-15-mvgdmxo.tar.xz
Size 82.68MiB
Owner root
sha256 81b356f0069b4bc35dc4ae016e2c25369c9cefa214c75973cde8b8470ffa4516
Please send this file to your support representative.
[root@master-0 /]#
You can also add the flag --all-logs
to the sos
command to retrieve further configuration file and logs
For more information see the official OCP documentation here
First we need to gather compute node info from OpenstackDataPlaneNodeSet
resources.
$ oc get openstackdataplanenodesets -o name
openstackdataplanenodeset.dataplane.openstack.org/openstack-edpm-ipam
$ oc get openstackdataplanenodeset.dataplane.openstack.org/openstack-edpm-ipam -o json | jq -r '.spec.nodes[] | [.hostName, .ansible.ansibleHost] | @csv'
"edpm-compute-0","192.168.122.100"
"edpm-compute-1","192.168.122.101"
"edpm-compute-2","192.168.122.102"
Then you have to login into nodes via ssh. In the OpenstackDataplaneNodeSet
resource you can find the user used by ansible and the private ssh, that we can extract from the secret.
# user
$ oc get openstackdataplanenodeset.dataplane.openstack.org/openstack-edpm-ipam -o json | jq .spec.nodeTemplate.ansible.ansibleUser
"zuul"
# private ssh key
$ oc get openstackdataplanenodeset.dataplane.openstack.org/openstack-edpm-ipam -o json |jq .spec.nodeTemplate.ansibleSSHPrivateKeySecret
"dataplane-ansible-ssh-private-key-secret"
# save private ssh key to a file and fix its permissions
$ oc get secret/dataplane-ansible-ssh-private-key-secret -o go-template='{{ index .data "ssh-privatekey" | base64decode }}' > ~/.ssh/compute.key
$ chmod 0600 ~/.ssh/compute.key
Finally login into compute node
$ ssh -i ~/.ssh/compute.key [email protected]
Register this system with Red Hat Insights: insights-client --register
Create an account or view all your systems at https://red.ht/insights-dashboard
Last login: Wed Nov 15 10:02:11 2023 from 192.168.111.1
[zuul@compute-0 ~]$
And launch the sos
command just like we did in the other example, but with different flags.
sudo sos report -p system,storage,virt,openstack_edpm
As you can see in the example above, there's a new profile for the sos
tool, openstack_edpm
, which enables the gathering from the plugins running on the node.
The output file /var/tmp/sosreport-$hostname-$date-$hash.tar.xz
will contains all the required files from paths introduced in the new containerized services, like /var/lib/openstack/
, /var/lib/edpm-config
, /var/log/containers/
and so on.
If the image is pushed to quay.io
registry, make sure it's set to public
,
otherwise it can't be consumed by the must-gather tool.