Skip to content

Commit

Permalink
Add troubleshooting documentation for incompatible Kubernetes and con…
Browse files Browse the repository at this point in the history
…tainer runtime issues (#452)

* Add troubleshooting documentation for incompatible Kubernetes and container runtime issues
  • Loading branch information
jvoravong authored May 24, 2022
1 parent 99fa46a commit 1f88094
Show file tree
Hide file tree
Showing 2 changed files with 267 additions and 0 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## Unreleased

### Added

- Add troubleshooting documentation for incompatible Kubernetes and container runtime issues (#452)

### Fixed

- Fix native OTel logs collection where 0 length logs cause errors after the 0.29.0 opentelemetry-logs-library changes in 0.49.0 (#TBD)
Expand Down
263 changes: 263 additions & 0 deletions docs/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,3 +130,266 @@ agent:
```

Similar can be applied to any other failing exporter.

## Possible problems with Kubernetes and container runtimes

A Kubernetes cluster using an incompatible container runtime for its version or
configuration could experience these issues cluster-wide:
- Stats from containers, pods, or nodes being absent or malformed. As a result,
the Splunk OTel Collector that requires these stats will not produce the
desired corresponding metrics.
- Containers, pods, and nodes failing to start successfully or stop cleanly.
- The Kubelet process on a node being in a defunct state.

Kubernetes requires you to install a
[container runtime](https://kubernetes.io/docs/setup/production-environment/container-runtimes/)
on each node in the cluster so that pods can run there. Multiple container
runtimes such as containerd, CRI-O, Docker, and Marantis (formerly Docker
Engine – Enterprise) are well-supported. The compatability level of a specific
Kubernetes version and container runtime can vary, it is recommended to use a
version of Kubernetes and a container runtime that has been documented to be
compatible.

### Troubleshooting Kubernetes and container runtime incompatibility

- Find out what Kubernetes and container runtime are being used.
- In the example below, node-1 uses Kubernetes 1.19.6 and containerd 1.4.1.
```
kubectl get nodes -o wide
NAME STATUS VERSION CONTAINER-RUNTIME
node-1 Ready v1.19.6 containerd://1.4.1
```
- Verify that you are using a container runtime that has been documented to
work with your Kubernetes version. Container runtime creators document
compatibility in their respective projects, you can view the documentation for
the mentioned container runtimes with the links below.
- [containerd](https://containerd.io/releases/#kubernetes-support)
- [CRI-O](https://github.com/cri-o/cri-o#compatibility-matrix-cri-o--kubernetes)
- [Mirantis](https://docs.mirantis.com/container-cloud/latest/compat-matrix.html)
- Use the Kubelet "summary" API to verify container, pod, and node stats.
- In this section we will verify the cpu, memory, and networks stats that are
used to generate the
[Kubelet Stats Receiver metrics](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/documentation.md#metrics)
by the collector are present. You can expand these techniques to evaluate
other Kubernetes stats that are available. All the stats in these commands
and sample outputs below should be present unless otherwise noted. If your
output is missing stats or your stat values appear to be in a different
format, your Kubernetes cluster and container runtime might not be fully
compatible.
<details>
<summary>1) Verify a node's stats</summary>

```
# Get the names of the nodes in your cluster.
kubectl get nodes -o wide
# Pick a node to evaluate and set its name to an environment variable.
NODE_NAME=node-1
# Verify the node has proper stats with this command and sample output.
kubectl get --raw "/api/v1/nodes/"${NODE_NAME}"/proxy/stats/summary" | jq '{"node": {"name": .node.nodeName, "cpu": .node.cpu, "memory": .node.memory, "network": .node.network}} | del(.node.network.interfaces)'
{
"node": {
"name": "node-1",
"cpu": {
"time": "2022-05-20T18:12:08Z",
"usageNanoCores": 149771849,
"usageCoreNanoSeconds": 2962750554249399
},
"memory": {
"time": "2022-05-20T18:12:08Z",
"availableBytes": 2701385728, # Could be absent if node memory allocations were missing.
"usageBytes": 3686178816,
"workingSetBytes": 1421492224,
"rssBytes": 634343424,
"pageFaults": 18632526,
"majorPageFaults": 726
},
"network": {
"time": "2022-05-20T18:12:08Z",
"name": "eth0",
"rxBytes": 105517219156,
"rxErrors": 0,
"txBytes": 98151853779,
"txErrors": 0
}
}
}

# For reference, here is the mapping for the node stat names to the Splunk Otel Collector metric names.
# cpu.usageNanoCores -> k8s.node.cpu.utilization
# cpu.usageCoreNanoSeconds -> k8s.node.cpu.time
# memory.availableBytes -> k8s.node.memory.available
# memory.usageBytes -> k8s.node.filesystem.usage
# memory.workingSetBytes -> k8s.node.memory.working_set
# memory.rssBytes -> k8s.node.memory.rss
# memory.pageFaults -> k8s.node.memory.page_faults
# memory.majorPageFaults -> k8s.node.memory.major_page_faults
# network.rxBytes -> k8s.node.network.io{direction="receive"}
# network.rxErrors -> k8s.node.network.errors{direction="receive"}
# network.txBytes -> k8s.node.network.io{direction="transmit"}
# network.txErrors -> k8s.node.network.error{direction="transmit"}
```
</details>

<details>
<summary>2) Verify a pod's stats</summary>

```
# Get the names of the pods in your node.
kubectl get --raw "/api/v1/nodes/"${NODE_NAME}"/proxy/stats/summary" | jq '.pods[].podRef.name'
# Pick a pod to evaluate and set its name to an environment variable.
POD_NAME=splunk-otel-collector-agent-6llkr
# Verify the pod has proper stats with this command and sample output.
kubectl get --raw "/api/v1/nodes/"${NODE_NAME}"/proxy/stats/summary" | jq '.pods[] | select(.podRef.name=='\"$POD_NAME\"') | {"pod": {"name": .podRef.name, "cpu": .cpu, "memory": .memory, "network": .network}} | del(.pod.network.interfaces)'
{
"pod": {
"name": "splunk-otel-collector-agent-6llkr",
"cpu": {
"time": "2022-05-20T18:38:47Z",
"usageNanoCores": 10774467,
"usageCoreNanoSeconds": 1709095026234
},
"memory": {
"time": "2022-05-20T18:38:47Z",
"availableBytes": 781959168, # Could be absent if pod memory limits were missing.
"usageBytes": 267563008,
"workingSetBytes": 266616832,
"rssBytes": 257036288,
"pageFaults": 0,
"majorPageFaults": 0
},
"network": {
"time": "2022-05-20T18:38:55Z",
"name": "eth0",
"rxBytes": 105523812442,
"rxErrors": 0,
"txBytes": 98159696431,
"txErrors": 0
}
}
}

# For reference, here is the mapping for the pod stat names to the Splunk Otel Collector metric names.
# Some of these metrics have a current and a legacy name, current names will be listed first.
# pod.cpu.usageNanoCores -> k8s.pod.cpu.utilization
# pod.cpu.usageCoreNanoSeconds -> k8s.pod.cpu.time
# pod.memory.availableBytes -> k8s.pod.memory.available
# pod.memory.usageBytes -> k8s.pod.filesystem.usage
# pod.memory.workingSetBytes -> k8s.pod.memory.working_set
# pod.memory.rssBytes -> k8s.pod.memory.rss
# pod.memory.pageFaults -> k8s.pod.memory.page_faults
# pod.memory.majorPageFaults -> k8s.pod.memory.major_page_faults
# pod.network.rxBytes -> k8s.pod.network.io{direction="receive"} or pod_network_receive_bytes_total
# pod.network.rxErrors -> k8s.pod.network.errors{direction="receive"} or pod_network_receive_errors_total
# pod.network.txBytes -> k8s.pod.network.io{direction="transmit"} or pod_network_transmit_bytes_total
# pod.network.txErrors -> k8s.pod.network.error{direction="transmit"} or pod_network_transmit_errors_total
```

</details>

<details>
<summary>3) Verify a container's stats</summary>

```
# Get the names of the containers in your pod.
kubectl get --raw "/api/v1/nodes/"${NODE_NAME}"/proxy/stats/summary" | jq '.pods[] | select(.podRef.name=='\"$POD_NAME\"') | .containers[].name'
# Pick a container to evaluate and set it's name to an enviroment variable.
CONTAINER_NAME=otel-collector
# Verify the container has proper stats with this command and sample output.
kubectl get --raw "/api/v1/nodes/"${NODE_NAME}"/proxy/stats/summary" | jq '.pods[] | select(.podRef.name=='\"$POD_NAME\"') | .containers[] | select(.name=='\"$CONTAINER_NAME\"') | {"container": {"name": .name, "cpu": .cpu, "memory": .memory}}'
{
"container": {
"name": "otel-collector",
"cpu": {
"time": "2022-05-20T18:42:15Z",
"usageNanoCores": 6781417,
"usageCoreNanoSeconds": 1087899649154
},
"memory": {
"time": "2022-05-20T18:42:15Z",
"availableBytes": 389480448, # Could be absent if container memory limits were missing.
"usageBytes": 135753728,
"workingSetBytes": 134807552,
"rssBytes": 132923392,
"pageFaults": 93390,
"majorPageFaults": 0
}
}
}

# For reference, here is the mapping for the container stat names to the Splunk Otel Collector metric names.
# container.cpu.usageNanoCores -> container.cpu.utilization
# container.cpu.usageCoreNanoSeconds -> container.cpu.time
# container.memory.availableBytes -> container.memory.available
# container.memory.usageBytes -> container.memory.usage
# container.memory.workingSetBytes -> container.memory.working_set
# container.memory.rssBytes -> container.memory.rss
# container.memory.pageFaults -> container.memory.page_faults
# container.memory.majorPageFaults -> container.memory.major_page_faults
```
</details>

### Reported incompatible Kubernetes and container runtime issues

- Note: Managed Kubernetes services might use a modified container runtime,
the service provider may have applied custom patches or bug fixes that aren't
present within an unmodified container runtime.
- Kubernetes 1.21.0-1.21.11 using containerd - Memory and network stats/metrics
can be missing
<details>
<summary>Expand for more details</summary>

- Affected metrics:
- k8s.pod.network.io{direction="receive"} or
pod_network_receive_bytes_total
- k8s.pod.network.errors{direction="receive"} or
pod_network_receive_errors_total
- k8s.pod.network.io{direction="transmit"} or
pod_network_transmit_bytes_total
- k8s.pod.network.error{direction="transmit"} or
pod_network_transmit_errors_total
- container.memory.available
- container.memory.usage
- container.memory.rssBytes
- container.memory.page_faults
- container.memory.major_page_faults
- Resolutions:
- Upgrading Kubernetes to at least 1.21.12 fixed all the missing metrics.
- Upgrading containerd to a newer version of 1.4.x or 1.5.x is still
recommended.
</details>
- Kubernetes 1.22.0-1.22.8 using containerd 1.4.0-1.4.12 - Memory and network
stats/metrics can be missing
<details>
<summary>Expand for more details</summary>

- Affected metrics:
- k8s.pod.network.io{direction="receive"} or
pod_network_receive_bytes_total
- k8s.pod.network.errors{direction="receive"} or
pod_network_receive_errors_total
- k8s.pod.network.io{direction="transmit"} or
pod_network_transmit_bytes_total
- k8s.pod.network.error{direction="transmit"} or
pod_network_transmit_errors_total
- k8s.pod.memory.available
- container.memory.available
- container.memory.usage
- container.memory.rssBytes
- container.memory.page_faults
- container.memory.major_page_faults
- Resolutions:
- Upgrading Kubernetes to at least 1.22.9 fixed the missing container
memory and pod network metrics.
- Upgrading containerd to at least 1.4.13 or 1.5.0 fixed the missing pod
memory metrics.
</details>
- Kubernetes 1.23.0-1.23.6 using containerd - Memory stats/metrics can be
missing
<details>
<summary>Expand for more details</summary>

- Affected metrics:
- k8s.pod.memory.available
- Resolutions:
- No resolutions have been documented as of 2022-05-2.
</details>

0 comments on commit 1f88094

Please sign in to comment.