-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics for admission attempts count and duration #233
Add metrics for admission attempts count and duration #233
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
f519b44
to
3173820
Compare
98ba201
to
dbc015e
Compare
@alculquicondor: GitHub didn't allow me to assign the following users: kerthcet. Note that only kubernetes-sigs members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold |
dbc015e
to
f0d01ea
Compare
Add role and rolebinding for prometheus to be able to list the services in the kueue-system namespace. Change-Id: I77cf51536ebf53ece9a4ba2d8457dbc3e71d1e8d
f0d01ea
to
cce3cdf
Compare
@@ -20,6 +20,3 @@ spec: | |||
- containerPort: 8443 | |||
protocol: TCP | |||
name: https | |||
- name: manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you decide to remove this? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It initially had flags related to metrics, but they are all in the component config now. There is no point of keeping it in the patch.
pkg/metrics/metrics.go
Outdated
prometheus.CounterOpts{ | ||
Subsystem: subsystemName, | ||
Name: "admission_attempts_total", | ||
Help: "Number of attempts to admit pods, by result. `success` means that at least one workload was admitted, `inadmissible` means that no workload was admitted.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/pods/workloads
"at least one"
hmm, I don't think this will be easy to understand, we need to clarify that each attempt could admit more than one workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified.
unless we should increase the counter with the number of workloads?
Although that's not very useful if you want to measure the throughput of the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg, we should have a separate gauge metric tracking the number of workloads broken down by pending vs admitted and by ClusterQueue
pkg/metrics/metrics.go
Outdated
prometheus.HistogramOpts{ | ||
Subsystem: subsystemName, | ||
Name: "admission_attempt_duration_seconds", | ||
Help: "Latency of an admission attempt", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Help: "Latency of an admission attempt", | |
Help: "Latency of an admission attempt broken down by result", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Change-Id: If6f34c08cc4f2d222815ecb257bd37ef5b4ae969
/label tide/merge-method-squash |
/lgtm |
/hold cancel |
kind: Role | ||
metadata: | ||
name: prometheus-k8s | ||
namespace: system |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we create Role
in system
namespace? What is system
namespace for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, this is where kustomize works, refer to
kueue/config/default/kustomization.yaml
Lines 4 to 9 in a4703d3
# Value of this field is prepended to the | |
# names of all resources, e.g. a deployment named | |
# "wordpress" becomes "alices-wordpress". | |
# Note that it should also match with the prefix (text before '-') of the namespace | |
# field above. | |
namePrefix: kueue- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks
…s#233) * Add metrics for admission attempts count and duration Add role and rolebinding for prometheus to be able to list the services in the kueue-system namespace. Change-Id: I77cf51536ebf53ece9a4ba2d8457dbc3e71d1e8d * Improve help messages Change-Id: If6f34c08cc4f2d222815ecb257bd37ef5b4ae969
What type of PR is this?
/kind feature
What this PR does / why we need it:
Add metrics for admission attempts count and duration, needed to track that the scheduler is doing work.
Add role and rolebinding for prometheus to be able to list the services in the kueue-system namespace.
Which issue(s) this PR fixes:
Part of #199 (4)
Special notes for your reviewer:
This doesn't add the prometheus ServiceMonitor, Role or RoleBinding by default, as users might have their own monitoring system, or they might not want to setup any prometheus at all.
In a follow up, I'll document how users can setup prometheus.