Skip to content

Commit

Permalink
Merge pull request #1013 from clamoriniere/feature/addTerminatingPhas…
Browse files Browse the repository at this point in the history
…eInPod_Status_Phase_metric

Add "Terminating" status in kube_pod_status_phase metrics
  • Loading branch information
k8s-ci-robot authored Feb 5, 2020
2 parents aa8a0af + 311c682 commit 2148cb9
Show file tree
Hide file tree
Showing 5 changed files with 156 additions and 4 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ doccheck: generate
@echo "- Checking if the generated documentation is up to date..."
@git diff --exit-code
@echo "- Checking if the documentation is in sync with the code..."
@grep -hoE '(kube_[^ |]+)' docs/* --exclude=README.md| sort -u > documented_metrics
@grep -hoE '(\| kube_[^ |]+)' docs/* --exclude=README.md| sed -E 's/\| //g' | sort -u > documented_metrics
@find internal/store -type f -not -name '*_test.go' -exec sed -nE 's/.*"(kube_[^"]+)"/\1/p' {} \; | sed -E 's/,//g' | sort -u > code_metrics
@diff -u0 code_metrics documented_metrics || (echo "ERROR: Metrics with - are present in code but missing in documentation, metrics with + are documented but not found in code."; exit 1)
@echo OK
Expand Down
33 changes: 32 additions & 1 deletion docs/pod-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@
| kube_pod_container_status_restarts_total | Counter | `container`=&lt;container-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `pod`=&lt;pod-name&gt; | STABLE |
| kube_pod_container_resource_requests | Gauge | `resource`=&lt;resource-name&gt; <br> `unit`=&lt;resource-unit&gt; <br> `container`=&lt;container-name&gt; <br> `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `node`=&lt; node-name&gt; | STABLE |
| kube_pod_container_resource_limits | Gauge | `resource`=&lt;resource-name&gt; <br> `unit`=&lt;resource-unit&gt; <br> `container`=&lt;container-name&gt; <br> `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `node`=&lt; node-name&gt; | STABLE |
| kube_pod_created | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; |
| kube_pod_created | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; | STABLE |
| kube_pod_deleted | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; | EXPERIMENTAL |
| kube_pod_restart_policy | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `type`=&lt;Always|Never|OnFailure&gt; | STABLE |
| kube_pod_init_container_info | Gauge | `container`=&lt;container-name&gt; <br> `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `image`=&lt;image-name&gt; <br> `image_id`=&lt;image-id&gt; <br> `container_id`=&lt;containerid&gt; | STABLE |
| kube_pod_init_container_status_waiting | Gauge | `container`=&lt;container-name&gt; <br> `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; | STABLE |
Expand All @@ -35,5 +36,35 @@
| kube_pod_init_container_resource_limits | Gauge | `resource`=&lt;resource-name&gt; <br> `unit`=&lt;resource-unit&gt; <br> `container`=&lt;container-name&gt; <br> `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `node`=&lt; node-name&gt; | STABLE |
| kube_pod_spec_volumes_persistentvolumeclaims_info | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `volume`=&lt;volume-name&gt; <br> `persistentvolumeclaim`=&lt;persistentvolumeclaim-claimname&gt; | STABLE |
| kube_pod_spec_volumes_persistentvolumeclaims_readonly | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `volume`=&lt;volume-name&gt; <br> `persistentvolumeclaim`=&lt;persistentvolumeclaim-claimname&gt; | STABLE |
| kube_pod_status_reason | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; <br> `reason`=&lt;NodeLost\|Evicted\&gt; | EXPERIMENTAL |
| kube_pod_status_scheduled_time | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; | STABLE |
| kube_pod_status_unschedulable | Gauge | `pod`=&lt;pod-name&gt; <br> `namespace`=&lt;pod-namespace&gt; | STABLE |

## Useful metrics queries

### How to retrieve none standard Pod state

It is not straightforward to get the Pod states for certain cases like "Terminating" and "Unknown" since it is not stored behind a field in the `Pod.Status`.

So to get them, you will need to compose multiple metrics (like it is done in the `kubectl` command line code).

For example:

* To get the list of pods that are in the `Unknown` state, you can run the following promQL query: `count(kube_pod_status_phase{phase="Running"}) by (namespace, pod) * count(kube_pod_status_reason{reason="NodeLost"}) by(namespace, pod)`

* For Pods in `Terminated` state: `count(kube_pod_status_phase{phase="Running"}) by (namespace, pod) * count(kube_pod_deleted) by (namespace, pod) * count(kube_pod_status_reason{reason!="NodeLost"})) by (namespace, pod)`

Here is an example of a Prometheus rule that can be used to alert on a Pod that has been in the `Terminated` state for more than `5m`.

```yaml
groups:
- name: Pod state
rules:
- alert: PodsBlockInTerminatingState
expr: count(kube_pod_status_phase{phase="Running"}) by (namespace, pod) * count(kube_pod_deleted) by (namespace, pod) * count(kube_pod_status_reason{reason!="NodeLost"})) by (namespace, pod) > 0
for: 5m
labels:
severity: page
annotations:
summary: Pod {{labels.namespace}}/{{labels.pod}} block in terminating state.
```
45 changes: 45 additions & 0 deletions internal/store/pod.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ var (
descPodLabelsDefaultLabels = []string{"namespace", "pod"}
containerWaitingReasons = []string{"ContainerCreating", "CrashLoopBackOff", "CreateContainerConfigError", "ErrImagePull", "ImagePullBackOff", "CreateContainerError", "InvalidImageName"}
containerTerminatedReasons = []string{"OOMKilled", "Completed", "Error", "ContainerCannotRun", "DeadlineExceeded", "Evicted"}
podStatusReasons = []string{"NodeLost", "Evicted"}

podMetricFamilies = []generator.FamilyGenerator{
{
Expand Down Expand Up @@ -197,6 +198,26 @@ var (
}
}),
},
{
Name: "kube_pod_deleted",
Type: metric.Gauge,
Help: "Unix deletion timestamp",
GenerateFunc: wrapPodFunc(func(p *v1.Pod) *metric.Family {
ms := []*metric.Metric{}

if p.DeletionTimestamp != nil && !p.DeletionTimestamp.IsZero() {
ms = append(ms, &metric.Metric{
LabelKeys: []string{},
LabelValues: []string{},
Value: float64(p.DeletionTimestamp.Unix()),
})
}

return &metric.Family{
Metrics: ms,
}
}),
},
{
Name: "kube_pod_restart_policy",
Type: metric.Gauge,
Expand Down Expand Up @@ -354,6 +375,30 @@ var (
}
}),
},
{
Name: "kube_pod_status_reason",
Type: metric.Gauge,
Help: "The pod status reasons",
GenerateFunc: wrapPodFunc(func(p *v1.Pod) *metric.Family {
ms := []*metric.Metric{}

for _, reason := range podStatusReasons {
metric := &metric.Metric{}
metric.LabelKeys = []string{"reason"}
metric.LabelValues = []string{reason}
if p.Status.Reason == reason {
metric.Value = boolFloat64(true)
} else {
metric.Value = boolFloat64(false)
}
ms = append(ms, metric)
}

return &metric.Family{
Metrics: ms,
}
}),
},
{
Name: "kube_pod_container_info",
Type: metric.Gauge,
Expand Down
74 changes: 72 additions & 2 deletions internal/store/pod_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -863,6 +863,32 @@ kube_pod_container_status_last_terminated_reason{container="container7",namespac
`,
MetricNames: []string{"kube_pod_created", "kube_pod_info", "kube_pod_start_time", "kube_pod_completion_time", "kube_pod_owner"},
},
{
Obj: &v1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: "pod1",
CreationTimestamp: metav1.Time{Time: time.Unix(1500000000, 0)},
Namespace: "ns1",
UID: "abc-123-xxx",
DeletionTimestamp: &metav1.Time{Time: time.Unix(1800000000, 0)},
},
Spec: v1.PodSpec{
NodeName: "node1",
PriorityClassName: "system-node-critical",
},
Status: v1.PodStatus{
HostIP: "1.1.1.1",
PodIP: "1.2.3.4",
StartTime: &metav1StartTime,
},
},
Want: `
# HELP kube_pod_deleted Unix deletion timestamp
# TYPE kube_pod_deleted gauge
kube_pod_deleted{namespace="ns1",pod="pod1"} 1.8e+09
`,
MetricNames: []string{"kube_pod_deleted"},
},
{
Obj: &v1.Pod{
ObjectMeta: metav1.ObjectMeta{
Expand Down Expand Up @@ -1055,14 +1081,58 @@ kube_pod_container_status_last_terminated_reason{container="container7",namespac
},
Want: `
# HELP kube_pod_status_phase The pods current phase.
# HELP kube_pod_status_reason The pod status reasons
# TYPE kube_pod_status_phase gauge
# TYPE kube_pod_status_reason gauge
kube_pod_status_phase{namespace="ns4",phase="Failed",pod="pod4"} 0
kube_pod_status_phase{namespace="ns4",phase="Pending",pod="pod4"} 0
kube_pod_status_phase{namespace="ns4",phase="Running",pod="pod4"} 0
kube_pod_status_phase{namespace="ns4",phase="Succeeded",pod="pod4"} 0
kube_pod_status_phase{namespace="ns4",phase="Unknown",pod="pod4"} 1
kube_pod_status_reason{namespace="ns4",pod="pod4",reason="Evicted"} 0
kube_pod_status_reason{namespace="ns4",pod="pod4",reason="NodeLost"} 1
`,
MetricNames: []string{"kube_pod_status_phase"},
MetricNames: []string{"kube_pod_status_phase", "kube_pod_status_reason"},
},
{
Obj: &v1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: "pod4",
Namespace: "ns4",
DeletionTimestamp: &metav1.Time{},
},
Status: v1.PodStatus{
Phase: v1.PodRunning,
Reason: "Evicted",
},
},
Want: `
# HELP kube_pod_status_reason The pod status reasons
# TYPE kube_pod_status_reason gauge
kube_pod_status_reason{namespace="ns4",pod="pod4",reason="Evicted"} 1
kube_pod_status_reason{namespace="ns4",pod="pod4",reason="NodeLost"} 0
`,
MetricNames: []string{"kube_pod_status_reason"},
},
{
Obj: &v1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: "pod4",
Namespace: "ns4",
DeletionTimestamp: &metav1.Time{},
},
Status: v1.PodStatus{
Phase: v1.PodRunning,
Reason: "other reason",
},
},
Want: `
# HELP kube_pod_status_reason The pod status reasons
# TYPE kube_pod_status_reason gauge
kube_pod_status_reason{namespace="ns4",pod="pod4",reason="Evicted"} 0
kube_pod_status_reason{namespace="ns4",pod="pod4",reason="NodeLost"} 0
`,
MetricNames: []string{"kube_pod_status_reason"},
},
{
Obj: &v1.Pod{
Expand Down Expand Up @@ -1535,7 +1605,7 @@ func BenchmarkPodStore(b *testing.B) {
},
}

expectedFamilies := 35
expectedFamilies := 37
for n := 0; n < b.N; n++ {
families := f(pod)
if len(families) != expectedFamilies {
Expand Down
6 changes: 6 additions & 0 deletions main_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,8 @@ kube_pod_labels{namespace="default",pod="pod0"} 1
# HELP kube_pod_created Unix creation timestamp
# TYPE kube_pod_created gauge
kube_pod_created{namespace="default",pod="pod0"} 1.5e+09
# HELP kube_pod_deleted Unix deletion timestamp
# TYPE kube_pod_deleted gauge
# HELP kube_pod_restart_policy Describes the restart policy in use by this pod.
# TYPE kube_pod_restart_policy gauge
kube_pod_restart_policy{namespace="default",pod="pod0",type="Always"} 1
Expand All @@ -187,6 +189,10 @@ kube_pod_status_phase{namespace="default",pod="pod0",phase="Running"} 1
kube_pod_status_phase{namespace="default",pod="pod0",phase="Unknown"} 0
# HELP kube_pod_status_ready Describes whether the pod is ready to serve requests.
# TYPE kube_pod_status_ready gauge
# HELP kube_pod_status_reason The pod status reasons
# TYPE kube_pod_status_reason gauge
kube_pod_status_reason{namespace="default",pod="pod0",reason="Evicted"} 0
kube_pod_status_reason{namespace="default",pod="pod0",reason="NodeLost"} 0
# HELP kube_pod_status_scheduled Describes the status of the scheduling process for the pod.
# TYPE kube_pod_status_scheduled gauge
# HELP kube_pod_container_info Information about a container in a pod.
Expand Down

0 comments on commit 2148cb9

Please sign in to comment.