Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics for admission attempts count and duration #233

Merged
merged 2 commits into from
Apr 29, 2022

Conversation

alculquicondor
Copy link
Contributor

@alculquicondor alculquicondor commented Apr 27, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add metrics for admission attempts count and duration, needed to track that the scheduler is doing work.

Add role and rolebinding for prometheus to be able to list the services in the kueue-system namespace.

image

Which issue(s) this PR fixes:

Part of #199 (4)

Special notes for your reviewer:

This doesn't add the prometheus ServiceMonitor, Role or RoleBinding by default, as users might have their own monitoring system, or they might not want to setup any prometheus at all.

In a follow up, I'll document how users can setup prometheus.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 27, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 27, 2022
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 27, 2022
@alculquicondor alculquicondor force-pushed the admission_metrics branch 5 times, most recently from 98ba201 to dbc015e Compare April 27, 2022 19:11
@alculquicondor
Copy link
Contributor Author

/assign @kerthcet @ahg-g

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: GitHub didn't allow me to assign the following users: kerthcet.

Note that only kubernetes-sigs members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @kerthcet @ahg-g

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alculquicondor
Copy link
Contributor Author

/hold
to prevent premature merge

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 27, 2022
Add role and rolebinding for prometheus to be able to list the services in the kueue-system namespace.

Change-Id: I77cf51536ebf53ece9a4ba2d8457dbc3e71d1e8d
@@ -20,6 +20,3 @@ spec:
- containerPort: 8443
protocol: TCP
name: https
- name: manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you decide to remove this? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It initially had flags related to metrics, but they are all in the component config now. There is no point of keeping it in the patch.

prometheus.CounterOpts{
Subsystem: subsystemName,
Name: "admission_attempts_total",
Help: "Number of attempts to admit pods, by result. `success` means that at least one workload was admitted, `inadmissible` means that no workload was admitted.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/pods/workloads

"at least one"

hmm, I don't think this will be easy to understand, we need to clarify that each attempt could admit more than one workload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified.

unless we should increase the counter with the number of workloads?

Although that's not very useful if you want to measure the throughput of the scheduler.

Copy link
Contributor

@ahg-g ahg-g Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg, we should have a separate gauge metric tracking the number of workloads broken down by pending vs admitted and by ClusterQueue

prometheus.HistogramOpts{
Subsystem: subsystemName,
Name: "admission_attempt_duration_seconds",
Help: "Latency of an admission attempt",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Help: "Latency of an admission attempt",
Help: "Latency of an admission attempt broken down by result",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Change-Id: If6f34c08cc4f2d222815ecb257bd37ef5b4ae969
@ahg-g
Copy link
Contributor

ahg-g commented Apr 28, 2022

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 28, 2022
@ahg-g
Copy link
Contributor

ahg-g commented Apr 28, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 28, 2022
@ahg-g
Copy link
Contributor

ahg-g commented Apr 29, 2022

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2022
@k8s-ci-robot k8s-ci-robot merged commit 7cb7d69 into kubernetes-sigs:main Apr 29, 2022
kind: Role
metadata:
name: prometheus-k8s
namespace: system
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we create Role in system namespace? What is system namespace for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, this is where kustomize works, refer to

# Value of this field is prepended to the
# names of all resources, e.g. a deployment named
# "wordpress" becomes "alices-wordpress".
# Note that it should also match with the prefix (text before '-') of the namespace
# field above.
namePrefix: kueue-

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks

ahg-g pushed a commit to ahg-g/kueue that referenced this pull request Aug 10, 2022
…s#233)

* Add metrics for admission attempts count and duration

Add role and rolebinding for prometheus to be able to list the services in the kueue-system namespace.

Change-Id: I77cf51536ebf53ece9a4ba2d8457dbc3e71d1e8d

* Improve help messages

Change-Id: If6f34c08cc4f2d222815ecb257bd37ef5b4ae969
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants