Add metric for pending workloads, broken down by queue and cluster_queue #237

alculquicondor · 2022-04-29T16:09:43Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add metric pending_workloads. We only track it when there are changes in a queue. The total number of pending workloads for a cluster_queue can be obtained by aggregation.

Which issue(s) this PR fixes:

Part of #199

Special notes for your reviewer:

k8s-ci-robot · 2022-04-29T16:09:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alculquicondor · 2022-04-29T19:52:05Z

/hold
to prevent premature merge

alculquicondor · 2022-04-29T19:55:07Z

/assign @kerthcet @ahg-g

k8s-ci-robot · 2022-04-29T19:55:09Z

@alculquicondor: GitHub didn't allow me to assign the following users: kerthcet.

Note that only kubernetes-sigs members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @kerthcet @ahg-g

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ahg-g · 2022-04-29T20:50:02Z

pkg/metrics/metrics.go

+			Subsystem: subsystemName,
+			Name:      "pending_workloads",
+			Help:      "Number of pending workloads, per queue and cluster_queue.",
+		}, []string{"queue", "cluster_queue"})


nit: place cluster_queue first because the relation is cq -- 1:n -- q

pkg/queue/manager.go

ahg-g · 2022-04-29T21:49:36Z

/lgtm

alculquicondor · 2022-04-29T22:00:28Z

/hold cancel
fixed the integration test framework and squashed.

pkg/queue/queue.go

kerthcet · 2022-04-30T12:27:25Z

pkg/queue/manager.go

@@ -199,6 +201,7 @@ func (m *Manager) UpdateQueue(q *kueue.Queue) error {
 		}
 	}
 	qImpl.update(q)
+	qImpl.reportPendingWorkloads()


How about wrap with defer, difference here is when return errQueueDoesNotExist error, we will still run reportPendingWorkloads(). (Similar to other places)

We don't want to report when the queue doesn't exist.

kerthcet · 2022-04-30T12:35:44Z

pkg/metrics/metrics.go

 	metrics.Registry.MustRegister(
 		admissionAttempts,
 		admissionAttemptLatency,
+		PendingWorkloads,


What about lowcase PendingWorkloads to pendingWorkloads and move reportPendingWorkloads to this package and pass in the necessary parameters.

The necessary parameters are the cluster_queue and queue keys, which are the same parameters that WithValues requires. Then we still need a reportPendingWorkloads to get a key for the queue. So we would need to functions. I think this is clean enough.

denkensk · 2022-04-30T14:55:44Z

pkg/queue/manager.go

@@ -315,6 +321,7 @@ func (m *Manager) RequeueWorkload(ctx context.Context, info *workload.Info, imme

 func (m *Manager) DeleteWorkload(w *kueue.Workload) {
 	m.Lock()
+	defer m.Unlock()
 	m.deleteWorkloadFromQueueAndClusterQueue(w, queueKeyForWorkload(w))
 	m.Unlock()


rm m.Unlock in L326

Oops, I'll just revert this.

pkg/queue/queue.go

denkensk · 2022-04-30T15:04:49Z

/retest-required

kerthcet · 2022-05-01T11:04:29Z

Another solution for reference, counting the pending workloads in ClusterQueueImpl, it seems more stable. When added, metrics.GaugeMetric.Inc(), when deleted, metrics.GaugeMetric.Dec(). E.g.
https://github.com/kerthcet/kueue/blob/9d1d1da21c139a9b6cf188f9ebe58506a12f8a67/pkg/queue/cluster_queue_impl.go#L80-L103

Change-Id: I46a509a2612597004483cb8c90bb76dcfe95742d

alculquicondor

Rebased

Another solution for reference, counting the pending workloads in ClusterQueueImpl, it seems more stable

What makes you say that it would be more stable?

The source-of-truth for how many elements are in the queue is the queue implementation. Are you thinking of race conditions where 2 routines try to add/remove elements at the same time? That's not a problem because the operations take a lock.

alculquicondor · 2022-05-02T17:46:03Z

pkg/metrics/metrics.go

 	metrics.Registry.MustRegister(
 		admissionAttempts,
 		admissionAttemptLatency,
+		PendingWorkloads,


The necessary parameters are the cluster_queue and queue keys, which are the same parameters that WithValues requires. Then we still need a reportPendingWorkloads to get a key for the queue. So we would need to functions. I think this is clean enough.

alculquicondor · 2022-05-02T17:48:09Z

pkg/queue/manager.go

@@ -199,6 +201,7 @@ func (m *Manager) UpdateQueue(q *kueue.Queue) error {
 		}
 	}
 	qImpl.update(q)
+	qImpl.reportPendingWorkloads()


We don't want to report when the queue doesn't exist.

alculquicondor · 2022-05-02T17:49:09Z

pkg/queue/manager.go

@@ -315,6 +321,7 @@ func (m *Manager) RequeueWorkload(ctx context.Context, info *workload.Info, imme

 func (m *Manager) DeleteWorkload(w *kueue.Workload) {
 	m.Lock()
+	defer m.Unlock()
 	m.deleteWorkloadFromQueueAndClusterQueue(w, queueKeyForWorkload(w))
 	m.Unlock()


Oops, I'll just revert this.

pkg/queue/queue.go

ahg-g · 2022-05-03T20:25:55Z

/lgtm

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 29, 2022

k8s-ci-robot requested review from ArangoGutierrez and denkensk April 29, 2022 16:09

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 29, 2022

alculquicondor force-pushed the queue-metrics branch from ce04fc5 to 04e0e93 Compare April 29, 2022 19:51

alculquicondor changed the title ~~WIP Add metric for pending workloads, broken down by queue and cluster_queue~~ Add metric for pending workloads, broken down by queue and cluster_queue Apr 29, 2022

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2022

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2022

alculquicondor force-pushed the queue-metrics branch 2 times, most recently from e0411e1 to fba2b13 Compare April 29, 2022 19:54

k8s-ci-robot assigned ahg-g Apr 29, 2022

alculquicondor force-pushed the queue-metrics branch from fba2b13 to ed11d11 Compare April 29, 2022 19:57

ahg-g reviewed Apr 29, 2022

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 29, 2022

alculquicondor force-pushed the queue-metrics branch from fcbbf4f to 703b95f Compare April 29, 2022 22:00

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 29, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2022

kerthcet reviewed Apr 30, 2022

View reviewed changes

denkensk reviewed Apr 30, 2022

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 1, 2022

Add metric for pending workloads, broken down by queue and cluster_queue

0c4cf75

Change-Id: I46a509a2612597004483cb8c90bb76dcfe95742d

alculquicondor force-pushed the queue-metrics branch from 703b95f to 0c4cf75 Compare May 2, 2022 18:49

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 2, 2022

alculquicondor commented May 2, 2022

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 3, 2022

k8s-ci-robot merged commit c8f277e into kubernetes-sigs:main May 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metric for pending workloads, broken down by queue and cluster_queue #237

Add metric for pending workloads, broken down by queue and cluster_queue #237

alculquicondor commented Apr 29, 2022

k8s-ci-robot commented Apr 29, 2022

alculquicondor commented Apr 29, 2022

alculquicondor commented Apr 29, 2022

k8s-ci-robot commented Apr 29, 2022

ahg-g Apr 29, 2022

alculquicondor Apr 29, 2022

ahg-g commented Apr 29, 2022

alculquicondor commented Apr 29, 2022

kerthcet Apr 30, 2022

alculquicondor May 2, 2022

kerthcet Apr 30, 2022

alculquicondor May 2, 2022

denkensk Apr 30, 2022

alculquicondor May 2, 2022

denkensk commented Apr 30, 2022

kerthcet commented May 1, 2022

alculquicondor left a comment

alculquicondor May 2, 2022

alculquicondor May 2, 2022

alculquicondor May 2, 2022

ahg-g commented May 3, 2022

Add metric for pending workloads, broken down by queue and cluster_queue #237

Add metric for pending workloads, broken down by queue and cluster_queue #237

Conversation

alculquicondor commented Apr 29, 2022

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

k8s-ci-robot commented Apr 29, 2022

alculquicondor commented Apr 29, 2022

alculquicondor commented Apr 29, 2022

k8s-ci-robot commented Apr 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Apr 29, 2022

alculquicondor commented Apr 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

denkensk commented Apr 30, 2022

kerthcet commented May 1, 2022

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented May 3, 2022