Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "Terminating" status in kube_pod_status_phase metrics #1013

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ doccheck: generate
@echo "- Checking if the generated documentation is up to date..."
@git diff --exit-code
@echo "- Checking if the documentation is in sync with the code..."
@grep -hoE '(kube_[^ |]+)' docs/* --exclude=README.md| sort -u > documented_metrics
@grep -hoE '(\| kube_[^ |]+)' docs/* --exclude=README.md| sed -E 's/\| //g' | sort -u > documented_metrics
@find internal/store -type f -not -name '*_test.go' -exec sed -nE 's/.*"(kube_[^"]+)"/\1/p' {} \; | sed -E 's/,//g' | sort -u > code_metrics
@diff -u0 code_metrics documented_metrics || (echo "ERROR: Metrics with - are present in code but missing in documentation, metrics with + are documented but not found in code."; exit 1)
@echo OK
Expand Down
33 changes: 32 additions & 1 deletion docs/pod-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@
| kube_pod_container_status_restarts_total | Counter | `container`=&lt;container-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `pod`=&lt;pod-name&gt; | STABLE |
| kube_pod_container_resource_requests | Gauge | `resource`=&lt;resource-name&gt; <br> `unit`=&lt;resource-unit&gt; <br> `container`=&lt;container-name&gt; <br> `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `node`=&lt; node-name&gt; | STABLE |
| kube_pod_container_resource_limits | Gauge | `resource`=&lt;resource-name&gt; <br> `unit`=&lt;resource-unit&gt; <br> `container`=&lt;container-name&gt; <br> `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `node`=&lt; node-name&gt; | STABLE |
| kube_pod_created | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; |
| kube_pod_created | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; | STABLE |
| kube_pod_deleted | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; | EXPERIMENTAL |
| kube_pod_restart_policy | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `type`=&lt;Always|Never|OnFailure&gt; | STABLE |
| kube_pod_init_container_info | Gauge | `container`=&lt;container-name&gt; <br> `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `image`=&lt;image-name&gt; <br> `image_id`=&lt;image-id&gt; <br> `container_id`=&lt;containerid&gt; | STABLE |
| kube_pod_init_container_status_waiting | Gauge | `container`=&lt;container-name&gt; <br> `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; | STABLE |
Expand All @@ -35,5 +36,35 @@
| kube_pod_init_container_resource_limits | Gauge | `resource`=&lt;resource-name&gt; <br> `unit`=&lt;resource-unit&gt; <br> `container`=&lt;container-name&gt; <br> `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `node`=&lt; node-name&gt; | STABLE |
| kube_pod_spec_volumes_persistentvolumeclaims_info | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `volume`=&lt;volume-name&gt; <br> `persistentvolumeclaim`=&lt;persistentvolumeclaim-claimname&gt; | STABLE |
| kube_pod_spec_volumes_persistentvolumeclaims_readonly | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `volume`=&lt;volume-name&gt; <br> `persistentvolumeclaim`=&lt;persistentvolumeclaim-claimname&gt; | STABLE |
| kube_pod_status_reason | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `reason`=&lt;NodeLost\|Evicted\&gt; | EXPERIMENTAL |
tariq1890 marked this conversation as resolved.
Show resolved Hide resolved
| kube_pod_status_scheduled_time | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; | STABLE |
| kube_pod_status_unschedulable | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; | STABLE |

## Useful metrics queries

### How to retrieve none standard Pod state

It is not straightforward to get the Pod states for certain cases like "Terminating" and "Unknown" since it is not stored behind a field in the `Pod.Status`.

So to get them, you will need to compose multiple metrics (like it is done in the `kubectl` command line code).

For example:

* To get the list of pods that are in the `Unknown` state, you can run the following promQL query: `count(kube_pod_status_phase{phase="Running"}) by (namespace, pod) * count(kube_pod_status_reason{reason="NodeLost"}) by(namespace, pod)`

* For Pods in `Terminated` state: `count(kube_pod_status_phase{phase="Running"}) by (namespace, pod) * count(kube_pod_deleted) by (namespace, pod) * count(kube_pod_status_reason{reason!="NodeLost"})) by (namespace, pod)`

Here is an example of a Prometheus rule that can be used to alert on a Pod that has been in the `Terminated` state for more than `5m`.

```yaml
groups:
- name: Pod state
rules:
- alert: PodsBlockInTerminatingState
expr: count(kube_pod_status_phase{phase="Running"}) by (namespace, pod) * count(kube_pod_deleted) by (namespace, pod) * count(kube_pod_status_reason{reason!="NodeLost"})) by (namespace, pod) > 0
for: 5m
labels:
severity: page
annotations:
summary: Pod {{labels.namespace}}/{{labels.pod}} block in terminating state.
```
45 changes: 45 additions & 0 deletions internal/store/pod.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ var (
descPodLabelsDefaultLabels = []string{"namespace", "pod"}
containerWaitingReasons = []string{"ContainerCreating", "CrashLoopBackOff", "CreateContainerConfigError", "ErrImagePull", "ImagePullBackOff", "CreateContainerError", "InvalidImageName"}
containerTerminatedReasons = []string{"OOMKilled", "Completed", "Error", "ContainerCannotRun", "DeadlineExceeded", "Evicted"}
podStatusReasons = []string{"NodeLost", "Evicted"}

podMetricFamilies = []generator.FamilyGenerator{
{
Expand Down Expand Up @@ -197,6 +198,26 @@ var (
}
}),
},
{
Name: "kube_pod_deleted",
Type: metric.Gauge,
Help: "Unix deletion timestamp",
GenerateFunc: wrapPodFunc(func(p *v1.Pod) *metric.Family {
ms := []*metric.Metric{}

if p.DeletionTimestamp != nil && !p.DeletionTimestamp.IsZero() {
ms = append(ms, &metric.Metric{
LabelKeys: []string{},
LabelValues: []string{},
Value: float64(p.DeletionTimestamp.Unix()),
})
}

return &metric.Family{
Metrics: ms,
}
}),
},
{
Name: "kube_pod_restart_policy",
Type: metric.Gauge,
Expand Down Expand Up @@ -354,6 +375,30 @@ var (
}
}),
},
{
Name: "kube_pod_status_reason",
Type: metric.Gauge,
Help: "The pod status reasons",
GenerateFunc: wrapPodFunc(func(p *v1.Pod) *metric.Family {
ms := []*metric.Metric{}

for _, reason := range podStatusReasons {
metric := &metric.Metric{}
metric.LabelKeys = []string{"reason"}
metric.LabelValues = []string{reason}
if p.Status.Reason == reason {
metric.Value = boolFloat64(true)
} else {
metric.Value = boolFloat64(false)
}
ms = append(ms, metric)
}

return &metric.Family{
Metrics: ms,
}
}),
},
{
Name: "kube_pod_container_info",
Type: metric.Gauge,
Expand Down
74 changes: 72 additions & 2 deletions internal/store/pod_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -863,6 +863,32 @@ kube_pod_container_status_last_terminated_reason{container="container7",namespac
`,
MetricNames: []string{"kube_pod_created", "kube_pod_info", "kube_pod_start_time", "kube_pod_completion_time", "kube_pod_owner"},
},
{
Obj: &v1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: "pod1",
CreationTimestamp: metav1.Time{Time: time.Unix(1500000000, 0)},
Namespace: "ns1",
UID: "abc-123-xxx",
DeletionTimestamp: &metav1.Time{Time: time.Unix(1800000000, 0)},
},
Spec: v1.PodSpec{
NodeName: "node1",
PriorityClassName: "system-node-critical",
},
Status: v1.PodStatus{
HostIP: "1.1.1.1",
PodIP: "1.2.3.4",
StartTime: &metav1StartTime,
},
},
Want: `
# HELP kube_pod_deleted Unix deletion timestamp
# TYPE kube_pod_deleted gauge
kube_pod_deleted{namespace="ns1",pod="pod1"} 1.8e+09
`,
MetricNames: []string{"kube_pod_deleted"},
},
{
Obj: &v1.Pod{
ObjectMeta: metav1.ObjectMeta{
Expand Down Expand Up @@ -1055,14 +1081,58 @@ kube_pod_container_status_last_terminated_reason{container="container7",namespac
},
Want: `
# HELP kube_pod_status_phase The pods current phase.
# HELP kube_pod_status_reason The pod status reasons
# TYPE kube_pod_status_phase gauge
# TYPE kube_pod_status_reason gauge
kube_pod_status_phase{namespace="ns4",phase="Failed",pod="pod4"} 0
kube_pod_status_phase{namespace="ns4",phase="Pending",pod="pod4"} 0
kube_pod_status_phase{namespace="ns4",phase="Running",pod="pod4"} 0
kube_pod_status_phase{namespace="ns4",phase="Succeeded",pod="pod4"} 0
kube_pod_status_phase{namespace="ns4",phase="Unknown",pod="pod4"} 1
kube_pod_status_reason{namespace="ns4",pod="pod4",reason="Evicted"} 0
kube_pod_status_reason{namespace="ns4",pod="pod4",reason="NodeLost"} 1
`,
MetricNames: []string{"kube_pod_status_phase"},
MetricNames: []string{"kube_pod_status_phase", "kube_pod_status_reason"},
},
{
Obj: &v1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: "pod4",
Namespace: "ns4",
DeletionTimestamp: &metav1.Time{},
},
Status: v1.PodStatus{
Phase: v1.PodRunning,
Reason: "Evicted",
},
},
Want: `
# HELP kube_pod_status_reason The pod status reasons
# TYPE kube_pod_status_reason gauge
kube_pod_status_reason{namespace="ns4",pod="pod4",reason="Evicted"} 1
kube_pod_status_reason{namespace="ns4",pod="pod4",reason="NodeLost"} 0
`,
MetricNames: []string{"kube_pod_status_reason"},
},
{
Obj: &v1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: "pod4",
Namespace: "ns4",
DeletionTimestamp: &metav1.Time{},
},
Status: v1.PodStatus{
Phase: v1.PodRunning,
Reason: "other reason",
},
},
Want: `
# HELP kube_pod_status_reason The pod status reasons
# TYPE kube_pod_status_reason gauge
kube_pod_status_reason{namespace="ns4",pod="pod4",reason="Evicted"} 0
kube_pod_status_reason{namespace="ns4",pod="pod4",reason="NodeLost"} 0
`,
MetricNames: []string{"kube_pod_status_reason"},
},
{
Obj: &v1.Pod{
Expand Down Expand Up @@ -1535,7 +1605,7 @@ func BenchmarkPodStore(b *testing.B) {
},
}

expectedFamilies := 35
expectedFamilies := 37
for n := 0; n < b.N; n++ {
families := f(pod)
if len(families) != expectedFamilies {
Expand Down
6 changes: 6 additions & 0 deletions main_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,8 @@ kube_pod_labels{namespace="default",pod="pod0"} 1
# HELP kube_pod_created Unix creation timestamp
# TYPE kube_pod_created gauge
kube_pod_created{namespace="default",pod="pod0"} 1.5e+09
# HELP kube_pod_deleted Unix deletion timestamp
# TYPE kube_pod_deleted gauge
# HELP kube_pod_restart_policy Describes the restart policy in use by this pod.
# TYPE kube_pod_restart_policy gauge
kube_pod_restart_policy{namespace="default",pod="pod0",type="Always"} 1
Expand All @@ -187,6 +189,10 @@ kube_pod_status_phase{namespace="default",pod="pod0",phase="Running"} 1
kube_pod_status_phase{namespace="default",pod="pod0",phase="Unknown"} 0
# HELP kube_pod_status_ready Describes whether the pod is ready to serve requests.
# TYPE kube_pod_status_ready gauge
# HELP kube_pod_status_reason The pod status reasons
# TYPE kube_pod_status_reason gauge
kube_pod_status_reason{namespace="default",pod="pod0",reason="Evicted"} 0
kube_pod_status_reason{namespace="default",pod="pod0",reason="NodeLost"} 0
# HELP kube_pod_status_scheduled Describes the status of the scheduling process for the pod.
# TYPE kube_pod_status_scheduled gauge
# HELP kube_pod_container_info Information about a container in a pod.
Expand Down