Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster resource consumption metrics #3983

Merged

Conversation

elmiko
Copy link
Contributor

@elmiko elmiko commented Mar 30, 2021

This change adds 4 new metrics to the cluster autoscaler that help a user to determine the CPU cores and memory usage in their cluster.

There are 2 metrics devoted to the limits for cores and memory, with labels for maximum and minimum.

  • cluster_autoscaler_cpu_limits_cores
  • cluster_autoscaler_memory_limits_bytes

There are also 2 metrics devoted to the current count of cores and memory in bytes.

  • cluster_autoscaler_cluster_cpu_current_cores
  • cluster_autoscaler_cluster_memory_current_bytes

Sample scrape of new metrics on a running cluster:

# HELP cluster_autoscaler_cpu_limits_cores [ALPHA] Minimum and maximum number of cores in the cluster.
# TYPE cluster_autoscaler_cpu_limits_cores gauge
cluster_autoscaler_cpu_limits_cores{direction="maximum"} 320000
cluster_autoscaler_cpu_limits_cores{direction="minimum"} 0

# HELP cluster_autoscaler_memory_limits_bytes [ALPHA] Minimum and maximum number of bytes of memory in cluster.
# TYPE cluster_autoscaler_memory_limits_bytes gauge
cluster_autoscaler_memory_limits_bytes{direction="maximum"} 6.8719476736e+15
cluster_autoscaler_memory_limits_bytes{direction="minimum"} 0

# HELP cluster_autoscaler_cluster_cpu_current_cores [ALPHA] Current number of cores in the cluster, minus deleting nodes.
# TYPE cluster_autoscaler_cluster_cpu_current_cores gauge
cluster_autoscaler_cluster_cpu_current_cores 24

# HELP cluster_autoscaler_cluster_memory_current_bytes [ALPHA] Current number of bytes of memory in the cluster, minus deleting nodes.
# TYPE cluster_autoscaler_cluster_memory_current_bytes gauge
cluster_autoscaler_cluster_memory_current_bytes 1.00827406336e+11

User Story
As a cluster autoscaler user, I would like to monitor my cluster through metrics to determine when the cluster is nearing its limits for cores and memory usage. By having metrics for the current counts as well as maximum and minimum limits, I will be able to effectively monitor my cluster.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 30, 2021
@elmiko
Copy link
Contributor Author

elmiko commented Mar 30, 2021

i took an approach of calculating the current counts once per interval in the RunOnce function, regardless of whether scaling happened or not. I think this will give us the most accurate count.

I have left the memory count in bytes as that is what we use internally, but I am not 100% sure this is the best practice for the metrics and I noticed the numbers get quite large. happy for any suggestions about making this better, perhaps use Gibibytes or Mebibytes here?

cc @gjtempleton ptal

@@ -27,6 +27,11 @@ All the metrics are prefixed with `cluster_autoscaler_`.
| nodes_count | Gauge | `state`=<node-state> | Number of nodes in cluster. |
| unschedulable_pods_count | Gauge | | Number of unschedulable ("Pending") pods in the cluster. |
| node_groups_count | Gauge | `node_group_type`=<node-group-type> | Number of node groups managed by CA. |
| max_nodes_count | Gauge | | Maximum number of nodes in all node groups. |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this metric was missing from the doc, but since it's somewhat related to resource counting i decided to add the description.

@gjtempleton
Copy link
Member

I'm slightly uncomfortable with the metric naming as is. I realise using *_count as the metric names is consistent with a number of the metrics we already have, but it could lead to users assuming the metrics are counters rather than gauges.

The metrics as written also don't make clear the unit being represented as Prometheus docs suggest is best practice and is consistent with metrics like kube state metrics' pod metrics, I wonder if these could both be solved by the name being something like:

cluster_autoscaler_cpu_limits_cores, cluster_autoscaler_memory_limits_bytes, cluster_autoscaler_cluster_cpu_current_cores, and cluster_autoscaler_cluster_memory_current_bytes

@gjtempleton
Copy link
Member

In terms of how often the metrics are evaluated and using bytes as the unit, I'm all in favour of both as they currently are.

Using mebibytes or larger would lead to us being inconsistent with a number of existing metrics across the kubernetes ecosystem (e.g. kube state metrics) and most tooling for querying these metrics is smart enough to handle the large number aspect for users and translate into the most relevant unit for them.

@elmiko
Copy link
Contributor Author

elmiko commented Apr 5, 2021

thanks for the review @gjtempleton , happy to change the names. i tend to agree with your reasoning.

This change adds 4 metrics that can be used to monitor the minimum and
maximum limits for CPU and memory, as well as the current counts in
cores and bytes, respectively.

The four metrics added are:
* `cluster_autoscaler_cpu_limits_cores`
* `cluster_autoscaler_cluster_cpu_current_cores`
* `cluster_autoscaler_memory_limits_bytes`
* `cluster_autoscaler_cluster_memory_current_bytes`

This change also adds the `max_cores_total` metric to the metrics
proposal doc, as it was previously not recorded there.

User story: As a cluster autoscaler user, I would like to monitor my
cluster through metrics to determine when the cluster is nearing its
limits for cores and memory usage.
@elmiko elmiko force-pushed the cluster-resource-consumption-metrics branch from cc12a63 to a24ea6c Compare April 6, 2021 14:39
@elmiko
Copy link
Contributor Author

elmiko commented Apr 6, 2021

updated

  • changed metric names
  • squashed into a single commit
  • updated function names to be more reflective of new metric names

@gjtempleton
Copy link
Member

Sorry for the delay in reviewing this again. Thanks for taking the feedback on board.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 28, 2021
Copy link
Contributor

@mwielgus mwielgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 13, 2021
@k8s-ci-robot k8s-ci-robot merged commit 2beea02 into kubernetes:master May 13, 2021
@elmiko elmiko deleted the cluster-resource-consumption-metrics branch May 14, 2021 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants