Cluster resource consumption metrics #3983

elmiko · 2021-03-30T20:01:52Z

This change adds 4 new metrics to the cluster autoscaler that help a user to determine the CPU cores and memory usage in their cluster.

There are 2 metrics devoted to the limits for cores and memory, with labels for maximum and minimum.

cluster_autoscaler_cpu_limits_cores
cluster_autoscaler_memory_limits_bytes

There are also 2 metrics devoted to the current count of cores and memory in bytes.

cluster_autoscaler_cluster_cpu_current_cores
cluster_autoscaler_cluster_memory_current_bytes

Sample scrape of new metrics on a running cluster:

# HELP cluster_autoscaler_cpu_limits_cores [ALPHA] Minimum and maximum number of cores in the cluster.
# TYPE cluster_autoscaler_cpu_limits_cores gauge
cluster_autoscaler_cpu_limits_cores{direction="maximum"} 320000
cluster_autoscaler_cpu_limits_cores{direction="minimum"} 0

# HELP cluster_autoscaler_memory_limits_bytes [ALPHA] Minimum and maximum number of bytes of memory in cluster.
# TYPE cluster_autoscaler_memory_limits_bytes gauge
cluster_autoscaler_memory_limits_bytes{direction="maximum"} 6.8719476736e+15
cluster_autoscaler_memory_limits_bytes{direction="minimum"} 0

# HELP cluster_autoscaler_cluster_cpu_current_cores [ALPHA] Current number of cores in the cluster, minus deleting nodes.
# TYPE cluster_autoscaler_cluster_cpu_current_cores gauge
cluster_autoscaler_cluster_cpu_current_cores 24

# HELP cluster_autoscaler_cluster_memory_current_bytes [ALPHA] Current number of bytes of memory in the cluster, minus deleting nodes.
# TYPE cluster_autoscaler_cluster_memory_current_bytes gauge
cluster_autoscaler_cluster_memory_current_bytes 1.00827406336e+11

User Story
As a cluster autoscaler user, I would like to monitor my cluster through metrics to determine when the cluster is nearing its limits for cores and memory usage. By having metrics for the current counts as well as maximum and minimum limits, I will be able to effectively monitor my cluster.

elmiko · 2021-03-30T20:04:21Z

i took an approach of calculating the current counts once per interval in the RunOnce function, regardless of whether scaling happened or not. I think this will give us the most accurate count.

I have left the memory count in bytes as that is what we use internally, but I am not 100% sure this is the best practice for the metrics and I noticed the numbers get quite large. happy for any suggestions about making this better, perhaps use Gibibytes or Mebibytes here?

cc @gjtempleton ptal

elmiko · 2021-03-30T20:11:20Z

cluster-autoscaler/proposals/metrics.md

@@ -27,6 +27,11 @@ All the metrics are prefixed with `cluster_autoscaler_`.
 | nodes_count | Gauge | `state`=&lt;node-state&gt; | Number of nodes in cluster. |
 | unschedulable_pods_count | Gauge | | Number of unschedulable ("Pending") pods in the cluster. |
 | node_groups_count | Gauge | `node_group_type`=&lt;node-group-type&gt; | Number of node groups managed by CA. |
+| max_nodes_count | Gauge | | Maximum number of nodes in all node groups. |


this metric was missing from the doc, but since it's somewhat related to resource counting i decided to add the description.

gjtempleton · 2021-04-05T13:28:19Z

I'm slightly uncomfortable with the metric naming as is. I realise using *_count as the metric names is consistent with a number of the metrics we already have, but it could lead to users assuming the metrics are counters rather than gauges.

The metrics as written also don't make clear the unit being represented as Prometheus docs suggest is best practice and is consistent with metrics like kube state metrics' pod metrics, I wonder if these could both be solved by the name being something like:

cluster_autoscaler_cpu_limits_cores, cluster_autoscaler_memory_limits_bytes, cluster_autoscaler_cluster_cpu_current_cores, and cluster_autoscaler_cluster_memory_current_bytes

gjtempleton · 2021-04-05T13:35:57Z

In terms of how often the metrics are evaluated and using bytes as the unit, I'm all in favour of both as they currently are.

Using mebibytes or larger would lead to us being inconsistent with a number of existing metrics across the kubernetes ecosystem (e.g. kube state metrics) and most tooling for querying these metrics is smart enough to handle the large number aspect for users and translate into the most relevant unit for them.

elmiko · 2021-04-05T14:59:06Z

thanks for the review @gjtempleton , happy to change the names. i tend to agree with your reasoning.

This change adds 4 metrics that can be used to monitor the minimum and maximum limits for CPU and memory, as well as the current counts in cores and bytes, respectively. The four metrics added are: * `cluster_autoscaler_cpu_limits_cores` * `cluster_autoscaler_cluster_cpu_current_cores` * `cluster_autoscaler_memory_limits_bytes` * `cluster_autoscaler_cluster_memory_current_bytes` This change also adds the `max_cores_total` metric to the metrics proposal doc, as it was previously not recorded there. User story: As a cluster autoscaler user, I would like to monitor my cluster through metrics to determine when the cluster is nearing its limits for cores and memory usage.

elmiko · 2021-04-06T14:40:41Z

updated

changed metric names
squashed into a single commit
updated function names to be more reflective of new metric names

gjtempleton · 2021-04-28T14:02:28Z

Sorry for the delay in reviewing this again. Thanks for taking the feedback on board.

/lgtm

mwielgus

/lgtm
/approve

k8s-ci-robot · 2021-05-13T22:30:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [mwielgus]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 30, 2021

k8s-ci-robot requested review from aleksandra-malinowska and feiskyer March 30, 2021 20:02

elmiko commented Mar 30, 2021

View reviewed changes

elmiko force-pushed the cluster-resource-consumption-metrics branch from cc12a63 to a24ea6c Compare April 6, 2021 14:39

jbartosik added the area/cluster-autoscaler label Apr 23, 2021

k8s-ci-robot assigned gjtempleton Apr 28, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 28, 2021

mwielgus approved these changes May 13, 2021

View reviewed changes

k8s-ci-robot assigned mwielgus May 13, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 13, 2021

k8s-ci-robot merged commit 2beea02 into kubernetes:master May 13, 2021

elmiko deleted the cluster-resource-consumption-metrics branch May 14, 2021 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster resource consumption metrics #3983

Cluster resource consumption metrics #3983

elmiko commented Mar 30, 2021 •

edited

Loading

elmiko commented Mar 30, 2021

elmiko Mar 30, 2021

gjtempleton commented Apr 5, 2021

gjtempleton commented Apr 5, 2021

elmiko commented Apr 5, 2021

elmiko commented Apr 6, 2021

gjtempleton commented Apr 28, 2021

mwielgus left a comment

k8s-ci-robot commented May 13, 2021

Cluster resource consumption metrics #3983

Cluster resource consumption metrics #3983

Conversation

elmiko commented Mar 30, 2021 • edited Loading

elmiko commented Mar 30, 2021

elmiko Mar 30, 2021

Choose a reason for hiding this comment

gjtempleton commented Apr 5, 2021

gjtempleton commented Apr 5, 2021

elmiko commented Apr 5, 2021

elmiko commented Apr 6, 2021

gjtempleton commented Apr 28, 2021

mwielgus left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 13, 2021

elmiko commented Mar 30, 2021 •

edited

Loading