diff --git a/CHANGELOG.md b/CHANGELOG.md
index dcf541fe42..35790f1192 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -4,6 +4,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
## Unreleased
+### Added
+
+- Add troubleshooting documentation for incompatible Kubernetes and container runtime issues (#452)
+
### Fixed
- Fix native OTel logs collection where 0 length logs cause errors after the 0.29.0 opentelemetry-logs-library changes in 0.49.0 (#TBD)
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
index 75ad212d24..9a44a9493b 100644
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -130,3 +130,266 @@ agent:
```
Similar can be applied to any other failing exporter.
+
+## Possible problems with Kubernetes and container runtimes
+
+A Kubernetes cluster using an incompatible container runtime for its version or
+configuration could experience these issues cluster-wide:
+- Stats from containers, pods, or nodes being absent or malformed. As a result,
+ the Splunk OTel Collector that requires these stats will not produce the
+ desired corresponding metrics.
+- Containers, pods, and nodes failing to start successfully or stop cleanly.
+- The Kubelet process on a node being in a defunct state.
+
+Kubernetes requires you to install a
+[container runtime](https://kubernetes.io/docs/setup/production-environment/container-runtimes/)
+on each node in the cluster so that pods can run there. Multiple container
+runtimes such as containerd, CRI-O, Docker, and Marantis (formerly Docker
+Engine – Enterprise) are well-supported. The compatability level of a specific
+Kubernetes version and container runtime can vary, it is recommended to use a
+version of Kubernetes and a container runtime that has been documented to be
+compatible.
+
+### Troubleshooting Kubernetes and container runtime incompatibility
+
+- Find out what Kubernetes and container runtime are being used.
+ - In the example below, node-1 uses Kubernetes 1.19.6 and containerd 1.4.1.
+ ```
+ kubectl get nodes -o wide
+ NAME STATUS VERSION CONTAINER-RUNTIME
+ node-1 Ready v1.19.6 containerd://1.4.1
+ ```
+- Verify that you are using a container runtime that has been documented to
+ work with your Kubernetes version. Container runtime creators document
+ compatibility in their respective projects, you can view the documentation for
+ the mentioned container runtimes with the links below.
+ - [containerd](https://containerd.io/releases/#kubernetes-support)
+ - [CRI-O](https://github.com/cri-o/cri-o#compatibility-matrix-cri-o--kubernetes)
+ - [Mirantis](https://docs.mirantis.com/container-cloud/latest/compat-matrix.html)
+- Use the Kubelet "summary" API to verify container, pod, and node stats.
+ - In this section we will verify the cpu, memory, and networks stats that are
+ used to generate the
+ [Kubelet Stats Receiver metrics](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/documentation.md#metrics)
+ by the collector are present. You can expand these techniques to evaluate
+ other Kubernetes stats that are available. All the stats in these commands
+ and sample outputs below should be present unless otherwise noted. If your
+ output is missing stats or your stat values appear to be in a different
+ format, your Kubernetes cluster and container runtime might not be fully
+ compatible.
+
+ 1) Verify a node's stats
+
+ ```
+ # Get the names of the nodes in your cluster.
+ kubectl get nodes -o wide
+ # Pick a node to evaluate and set its name to an environment variable.
+ NODE_NAME=node-1
+ # Verify the node has proper stats with this command and sample output.
+ kubectl get --raw "/api/v1/nodes/"${NODE_NAME}"/proxy/stats/summary" | jq '{"node": {"name": .node.nodeName, "cpu": .node.cpu, "memory": .node.memory, "network": .node.network}} | del(.node.network.interfaces)'
+ {
+ "node": {
+ "name": "node-1",
+ "cpu": {
+ "time": "2022-05-20T18:12:08Z",
+ "usageNanoCores": 149771849,
+ "usageCoreNanoSeconds": 2962750554249399
+ },
+ "memory": {
+ "time": "2022-05-20T18:12:08Z",
+ "availableBytes": 2701385728, # Could be absent if node memory allocations were missing.
+ "usageBytes": 3686178816,
+ "workingSetBytes": 1421492224,
+ "rssBytes": 634343424,
+ "pageFaults": 18632526,
+ "majorPageFaults": 726
+ },
+ "network": {
+ "time": "2022-05-20T18:12:08Z",
+ "name": "eth0",
+ "rxBytes": 105517219156,
+ "rxErrors": 0,
+ "txBytes": 98151853779,
+ "txErrors": 0
+ }
+ }
+ }
+
+ # For reference, here is the mapping for the node stat names to the Splunk Otel Collector metric names.
+ # cpu.usageNanoCores -> k8s.node.cpu.utilization
+ # cpu.usageCoreNanoSeconds -> k8s.node.cpu.time
+ # memory.availableBytes -> k8s.node.memory.available
+ # memory.usageBytes -> k8s.node.filesystem.usage
+ # memory.workingSetBytes -> k8s.node.memory.working_set
+ # memory.rssBytes -> k8s.node.memory.rss
+ # memory.pageFaults -> k8s.node.memory.page_faults
+ # memory.majorPageFaults -> k8s.node.memory.major_page_faults
+ # network.rxBytes -> k8s.node.network.io{direction="receive"}
+ # network.rxErrors -> k8s.node.network.errors{direction="receive"}
+ # network.txBytes -> k8s.node.network.io{direction="transmit"}
+ # network.txErrors -> k8s.node.network.error{direction="transmit"}
+ ```
+
+
+
+ 2) Verify a pod's stats
+
+ ```
+ # Get the names of the pods in your node.
+ kubectl get --raw "/api/v1/nodes/"${NODE_NAME}"/proxy/stats/summary" | jq '.pods[].podRef.name'
+ # Pick a pod to evaluate and set its name to an environment variable.
+ POD_NAME=splunk-otel-collector-agent-6llkr
+ # Verify the pod has proper stats with this command and sample output.
+ kubectl get --raw "/api/v1/nodes/"${NODE_NAME}"/proxy/stats/summary" | jq '.pods[] | select(.podRef.name=='\"$POD_NAME\"') | {"pod": {"name": .podRef.name, "cpu": .cpu, "memory": .memory, "network": .network}} | del(.pod.network.interfaces)'
+ {
+ "pod": {
+ "name": "splunk-otel-collector-agent-6llkr",
+ "cpu": {
+ "time": "2022-05-20T18:38:47Z",
+ "usageNanoCores": 10774467,
+ "usageCoreNanoSeconds": 1709095026234
+ },
+ "memory": {
+ "time": "2022-05-20T18:38:47Z",
+ "availableBytes": 781959168, # Could be absent if pod memory limits were missing.
+ "usageBytes": 267563008,
+ "workingSetBytes": 266616832,
+ "rssBytes": 257036288,
+ "pageFaults": 0,
+ "majorPageFaults": 0
+ },
+ "network": {
+ "time": "2022-05-20T18:38:55Z",
+ "name": "eth0",
+ "rxBytes": 105523812442,
+ "rxErrors": 0,
+ "txBytes": 98159696431,
+ "txErrors": 0
+ }
+ }
+ }
+
+ # For reference, here is the mapping for the pod stat names to the Splunk Otel Collector metric names.
+ # Some of these metrics have a current and a legacy name, current names will be listed first.
+ # pod.cpu.usageNanoCores -> k8s.pod.cpu.utilization
+ # pod.cpu.usageCoreNanoSeconds -> k8s.pod.cpu.time
+ # pod.memory.availableBytes -> k8s.pod.memory.available
+ # pod.memory.usageBytes -> k8s.pod.filesystem.usage
+ # pod.memory.workingSetBytes -> k8s.pod.memory.working_set
+ # pod.memory.rssBytes -> k8s.pod.memory.rss
+ # pod.memory.pageFaults -> k8s.pod.memory.page_faults
+ # pod.memory.majorPageFaults -> k8s.pod.memory.major_page_faults
+ # pod.network.rxBytes -> k8s.pod.network.io{direction="receive"} or pod_network_receive_bytes_total
+ # pod.network.rxErrors -> k8s.pod.network.errors{direction="receive"} or pod_network_receive_errors_total
+ # pod.network.txBytes -> k8s.pod.network.io{direction="transmit"} or pod_network_transmit_bytes_total
+ # pod.network.txErrors -> k8s.pod.network.error{direction="transmit"} or pod_network_transmit_errors_total
+ ```
+
+
+
+
+ 3) Verify a container's stats
+
+ ```
+ # Get the names of the containers in your pod.
+ kubectl get --raw "/api/v1/nodes/"${NODE_NAME}"/proxy/stats/summary" | jq '.pods[] | select(.podRef.name=='\"$POD_NAME\"') | .containers[].name'
+ # Pick a container to evaluate and set it's name to an enviroment variable.
+ CONTAINER_NAME=otel-collector
+ # Verify the container has proper stats with this command and sample output.
+ kubectl get --raw "/api/v1/nodes/"${NODE_NAME}"/proxy/stats/summary" | jq '.pods[] | select(.podRef.name=='\"$POD_NAME\"') | .containers[] | select(.name=='\"$CONTAINER_NAME\"') | {"container": {"name": .name, "cpu": .cpu, "memory": .memory}}'
+ {
+ "container": {
+ "name": "otel-collector",
+ "cpu": {
+ "time": "2022-05-20T18:42:15Z",
+ "usageNanoCores": 6781417,
+ "usageCoreNanoSeconds": 1087899649154
+ },
+ "memory": {
+ "time": "2022-05-20T18:42:15Z",
+ "availableBytes": 389480448, # Could be absent if container memory limits were missing.
+ "usageBytes": 135753728,
+ "workingSetBytes": 134807552,
+ "rssBytes": 132923392,
+ "pageFaults": 93390,
+ "majorPageFaults": 0
+ }
+ }
+ }
+
+ # For reference, here is the mapping for the container stat names to the Splunk Otel Collector metric names.
+ # container.cpu.usageNanoCores -> container.cpu.utilization
+ # container.cpu.usageCoreNanoSeconds -> container.cpu.time
+ # container.memory.availableBytes -> container.memory.available
+ # container.memory.usageBytes -> container.memory.usage
+ # container.memory.workingSetBytes -> container.memory.working_set
+ # container.memory.rssBytes -> container.memory.rss
+ # container.memory.pageFaults -> container.memory.page_faults
+ # container.memory.majorPageFaults -> container.memory.major_page_faults
+ ```
+
+
+### Reported incompatible Kubernetes and container runtime issues
+
+- Note: Managed Kubernetes services might use a modified container runtime,
+ the service provider may have applied custom patches or bug fixes that aren't
+ present within an unmodified container runtime.
+- Kubernetes 1.21.0-1.21.11 using containerd - Memory and network stats/metrics
+ can be missing
+
+ Expand for more details
+
+ - Affected metrics:
+ - k8s.pod.network.io{direction="receive"} or
+ pod_network_receive_bytes_total
+ - k8s.pod.network.errors{direction="receive"} or
+ pod_network_receive_errors_total
+ - k8s.pod.network.io{direction="transmit"} or
+ pod_network_transmit_bytes_total
+ - k8s.pod.network.error{direction="transmit"} or
+ pod_network_transmit_errors_total
+ - container.memory.available
+ - container.memory.usage
+ - container.memory.rssBytes
+ - container.memory.page_faults
+ - container.memory.major_page_faults
+ - Resolutions:
+ - Upgrading Kubernetes to at least 1.21.12 fixed all the missing metrics.
+ - Upgrading containerd to a newer version of 1.4.x or 1.5.x is still
+ recommended.
+
+- Kubernetes 1.22.0-1.22.8 using containerd 1.4.0-1.4.12 - Memory and network
+ stats/metrics can be missing
+
+ Expand for more details
+
+ - Affected metrics:
+ - k8s.pod.network.io{direction="receive"} or
+ pod_network_receive_bytes_total
+ - k8s.pod.network.errors{direction="receive"} or
+ pod_network_receive_errors_total
+ - k8s.pod.network.io{direction="transmit"} or
+ pod_network_transmit_bytes_total
+ - k8s.pod.network.error{direction="transmit"} or
+ pod_network_transmit_errors_total
+ - k8s.pod.memory.available
+ - container.memory.available
+ - container.memory.usage
+ - container.memory.rssBytes
+ - container.memory.page_faults
+ - container.memory.major_page_faults
+ - Resolutions:
+ - Upgrading Kubernetes to at least 1.22.9 fixed the missing container
+ memory and pod network metrics.
+ - Upgrading containerd to at least 1.4.13 or 1.5.0 fixed the missing pod
+ memory metrics.
+
+- Kubernetes 1.23.0-1.23.6 using containerd - Memory stats/metrics can be
+ missing
+
+ Expand for more details
+
+ - Affected metrics:
+ - k8s.pod.memory.available
+ - Resolutions:
+ - No resolutions have been documented as of 2022-05-2.
+