Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metric for pending workloads, broken down by queue and cluster_queue #237

Merged
merged 1 commit into from
May 3, 2022

Conversation

alculquicondor
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add metric pending_workloads. We only track it when there are changes in a queue. The total number of pending workloads for a cluster_queue can be obtained by aggregation.

Which issue(s) this PR fixes:

Part of #199

Special notes for your reviewer:

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 29, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 29, 2022
@alculquicondor alculquicondor changed the title WIP Add metric for pending workloads, broken down by queue and cluster_queue Add metric for pending workloads, broken down by queue and cluster_queue Apr 29, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2022
@alculquicondor
Copy link
Contributor Author

/hold
to prevent premature merge

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2022
@alculquicondor alculquicondor force-pushed the queue-metrics branch 2 times, most recently from e0411e1 to fba2b13 Compare April 29, 2022 19:54
@alculquicondor
Copy link
Contributor Author

/assign @kerthcet @ahg-g

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: GitHub didn't allow me to assign the following users: kerthcet.

Note that only kubernetes-sigs members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @kerthcet @ahg-g

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Subsystem: subsystemName,
Name: "pending_workloads",
Help: "Number of pending workloads, per queue and cluster_queue.",
}, []string{"queue", "cluster_queue"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: place cluster_queue first because the relation is cq -- 1:n -- q

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pkg/queue/manager.go Show resolved Hide resolved
pkg/queue/manager.go Show resolved Hide resolved
@ahg-g
Copy link
Contributor

ahg-g commented Apr 29, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 29, 2022
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 29, 2022
@alculquicondor
Copy link
Contributor Author

/hold cancel
fixed the integration test framework and squashed.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2022
pkg/queue/queue.go Show resolved Hide resolved
@@ -199,6 +201,7 @@ func (m *Manager) UpdateQueue(q *kueue.Queue) error {
}
}
qImpl.update(q)
qImpl.reportPendingWorkloads()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about wrap with defer, difference here is when return errQueueDoesNotExist error, we will still run reportPendingWorkloads(). (Similar to other places)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to report when the queue doesn't exist.

metrics.Registry.MustRegister(
admissionAttempts,
admissionAttemptLatency,
PendingWorkloads,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about lowcase PendingWorkloads to pendingWorkloads and move reportPendingWorkloads to this package and pass in the necessary parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The necessary parameters are the cluster_queue and queue keys, which are the same parameters that WithValues requires. Then we still need a reportPendingWorkloads to get a key for the queue. So we would need to functions. I think this is clean enough.

@@ -315,6 +321,7 @@ func (m *Manager) RequeueWorkload(ctx context.Context, info *workload.Info, imme

func (m *Manager) DeleteWorkload(w *kueue.Workload) {
m.Lock()
defer m.Unlock()
m.deleteWorkloadFromQueueAndClusterQueue(w, queueKeyForWorkload(w))
m.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm m.Unlock in L326

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I'll just revert this.

pkg/queue/queue.go Show resolved Hide resolved
@denkensk
Copy link
Member

/retest-required

@kerthcet
Copy link
Contributor

kerthcet commented May 1, 2022

Another solution for reference, counting the pending workloads in ClusterQueueImpl, it seems more stable. When added, metrics.GaugeMetric.Inc(), when deleted, metrics.GaugeMetric.Dec(). E.g.
https://github.com/kerthcet/kueue/blob/9d1d1da21c139a9b6cf188f9ebe58506a12f8a67/pkg/queue/cluster_queue_impl.go#L80-L103

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 1, 2022
Change-Id: I46a509a2612597004483cb8c90bb76dcfe95742d
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 2, 2022
Copy link
Contributor Author

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased

Another solution for reference, counting the pending workloads in ClusterQueueImpl, it seems more stable

What makes you say that it would be more stable?

The source-of-truth for how many elements are in the queue is the queue implementation. Are you thinking of race conditions where 2 routines try to add/remove elements at the same time? That's not a problem because the operations take a lock.

metrics.Registry.MustRegister(
admissionAttempts,
admissionAttemptLatency,
PendingWorkloads,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The necessary parameters are the cluster_queue and queue keys, which are the same parameters that WithValues requires. Then we still need a reportPendingWorkloads to get a key for the queue. So we would need to functions. I think this is clean enough.

@@ -199,6 +201,7 @@ func (m *Manager) UpdateQueue(q *kueue.Queue) error {
}
}
qImpl.update(q)
qImpl.reportPendingWorkloads()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to report when the queue doesn't exist.

@@ -315,6 +321,7 @@ func (m *Manager) RequeueWorkload(ctx context.Context, info *workload.Info, imme

func (m *Manager) DeleteWorkload(w *kueue.Workload) {
m.Lock()
defer m.Unlock()
m.deleteWorkloadFromQueueAndClusterQueue(w, queueKeyForWorkload(w))
m.Unlock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I'll just revert this.

pkg/queue/queue.go Show resolved Hide resolved
@ahg-g
Copy link
Contributor

ahg-g commented May 3, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 3, 2022
@k8s-ci-robot k8s-ci-robot merged commit c8f277e into kubernetes-sigs:main May 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants