KEP: Pending workloads visibility [expose in queue statuses] #991

mimowo · 2023-07-17T13:24:19Z

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #168

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

netlify · 2023-07-17T13:24:26Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`1b565f6`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/64cced587cccd10008cdc66d

alculquicondor · 2023-07-17T14:17:50Z

cc @trasc

keps/168-expose-pending-workloads-order/README.md

keps/168-pending-workloads-visibility/README.md

trasc · 2023-07-18T15:05:45Z

keps/168-pending-workloads-visibility/README.md

+type PendingWorkloadsVisibility struct {
+	// MaxPendingWorkloadsInStatus indicates the maximal number of pending
+	// workloads for which their local queue order is exposed.
+	// Defaults to 100.


Maybe default to 0 therefore disabled.

Why we would set it to 0? I think the feature is quite valuable so we should make it enabled by default. Still, we may consider going through alpha API to give it some time for testing.

IMO, if the performance is a concern, then we should make sure that the penalty is negligible, at least for sufficiently small values of N, but greater than 0 :).

tenzen-y · 2023-07-18T15:26:55Z

keps/168-pending-workloads-visibility/README.md

+	// MaxPendingWorkloadsInStatus indicates the maximal number of pending
+	// workloads for which their local queue order is exposed.
+	// Defaults to 100.
+	MaxTopPendingWorkloads *int32


Should we limit this to avoid LocalQueue's object size exceeding 1.5MiB in etcd?
Maybe, we should say object's size in the Risks and Mitigations section.

The limit in etcd is 1.5Mi, added a point

Right. My comment is a typo...
I fixed my comment.

Right, I think it makes sense not to put a hard constraint. This is because, the constrain depends on the lenghts of the names and namespaces of workloads. If we follow the 253 characters limitation of Kubernetes, we may end up with max of 2000, which would be very constraining. Especially, I think many organizations may have internal policies to limit the number of characters for namespaces, for example, as they derive the namespace names for names of internal teams, etc.

I suggest to just expand the user-facing documentation that going beyond 2000 might be problematic and should be thought through. Added a note in the Risks and Mitigations section.

keps/168-pending-workloads-visibility/README.md

alculquicondor · 2023-07-18T19:32:50Z

keps/168-pending-workloads-visibility/README.md

+
+### Goals
+
+- expose the order of workloads in the LocalQueue and ClusterQueue


Suggested change

- expose the order of workloads in the LocalQueue and ClusterQueue

- expose the order of workloads in the LocalQueue

Exposing the clusterqueue will allow the admin to have full visibility. It was suggested that the clusterqueue should be the higher priority work (#991 (comment)). Let me leave both for now.

First, it is important to note that the state of the localQueues might be inconsistent at a given moment. For example, two different workloads may say, they have position 1, due to the throttling of the updates.

Now, other approaches may exist to expose the information for admins, however, with the problems solved for throttling and object size limiting, the solution should be easy to transfer. Seems much easier, than to do an alternative approach dedicated for admins. Adding a note in alternatives.

alculquicondor · 2023-07-18T19:35:22Z

keps/168-pending-workloads-visibility/README.md

+
+The proposal is to extend the APIs for the status of LocalQueue and ClusterQueue
+to expose the order to workloads. The order will be only exposed up to some
+configurable depth.


Should we start with a fixed quantity? The limitation is on the size of the object. Why would users want less?
Instead, we could make configurable the rate at which the summary is updated.

Should we start with a fixed quantity? The limitation is on the size of the object.

Good point, I think I like it, will update accordingly

Why would users want less?

Exposing positional information may be triggering many updates, even if batched, there would be a request every 5s. With low number a user can reduce the rate of actual updates even more.

Another reason why very large threshold might be problematic, even having red-black trees (unless there are some versions allowing thread-safe iteration): if we iterate for long we need to take a read lock for the time of iteration. This can be problematic because it could mean lower throughput for update operations. One locked iteration per 5s should be ok, but allowing to limit the depth might be desirable too.

Also, after a second thought, cutting by the object size might be tricky. In particular, how would we estimate the size of the actual object without serializing it? We could add manually sizes of individual fields without serializing, but it might be hard to maintain. Also, if the procedure of checking the object size is expensive computationally it will make the iteration longer, if checked after every added item.

keps/168-pending-workloads-visibility/README.md

alculquicondor · 2023-07-18T19:50:26Z

keps/168-pending-workloads-visibility/README.md

+
+	// Position indicates the position of the workload among all pending
+	// workloads in the cluster queue.
+	Position *int32


we should call this approximatePosition, for two reasons:

by the time the position is posted, it's probably out of date.

it gives us the chance to use coarse numbers: 1, 2, 3, ..., 10, 20, 30, ..., 100, 200, 300, ..., 1000. This might potentially reduce the need for updates.

Why do we actually need this? Without the access to the name of the other workload, the position is not particularly useful. The user will not know whether there is 1 huge job or 100 tiny before his task in the queue. At the same time, Kueue will be using a lot of APIServer write quota to maintain this information.

we should call this approximatePosition, for two reasons:

by the time the position is posted, it's probably out of date.

If we batch updates and have 5s it does not seem relevant enough to name it approximate. Similarly, we call failed counter in Job controller despite some potential delay.

it gives us the chance to use coarse numbers: 1, 2, 3, ..., 10, 20, 30, ..., 100, 200, 300, ..., 1000. This might potentially reduce the need for updates.

I think we want coarse positions, we should probably mention a range rather than a specific positions.

Still, what is the point of this number? With large amount of LocalQueue pointing to the same ClusterQueue, the bigger part of the write quota will be used for these updates, even with 5 sec batching.

Still, what is the point of this number?

The point is to give the user some feedback about its position in the cluster queue (which the user does not see). This is analogous to the hot-line bots saying "your position is 5". As a caller you also don't know how long it would take for the other people calling to finish, however, you can get some estimate assuming 10min per caller. Similarly, in the context of Kueue you can get a rough estimate on how long it takes to process 5 items based on the historical velocity of the processing.

I think, ultimately along with the position we would like to have ETA. Getting ETA based on time execution per jobs might be involving (to collect data based on historical data per job). Still, we could have a rough ETA based on overall cluster queue velocity.

Honestly, I'm still not convinced (the benefits vs write cost are not clear), but maybe i'm wrong. What about adding it as a possible phase 2 feature?

This is a good point, returning positions adds a complication to the implementation to throttle the updates. I think we should indeed start the implementation without positions for localqueues + cluster queues.

Certainly we should split the implementation into small commits / PRs, when it makes sense. As for the KEP, yes I'm not sure about the phase process. I also could split the KEP into two also (cluster queue KEP, and local queue KEP).

However, I think returning local queue order without even positioning is pointless. Returning position gives at least some clue, if you can estimate the queue velocity.

Honestly, I'm still not convinced (the benefits vs write cost are not clear), but maybe i'm wrong. What about adding it as a possible phase 2 feature?

I have split the implementation into phases now. PTAL.

mwielgus · 2023-07-19T23:31:38Z

keps/168-pending-workloads-visibility/README.md

+
+### Goals
+
+- expose the order of workloads in the LocalQueue and ClusterQueue


Actually, IMHO, we should rather start from ClusterQueue and add LocalQueue a bit later.

I guess so, though some visibility into local queues seems needed sooner or later, so seems to make sense to discuss both in the same KEP. Also, we can split implementation into phases.

What is the rationale to start with ClusterQueue? given that they should be hidden from end users? Also they might have privacy concerns because the list would have information for multiple namespaces?

I think the rationale for starting with ClusterQueue is that it is simpler and has a lower potential perf impact.

I've updated the KEP to propose 2 graduation stages for the features. ClusterQueue for Beta and LocalQueue in the next phase for stable. Or we may have 2 stages for beta. Let me know wdyt.

keps/168-pending-workloads-visibility/README.md

mwielgus · 2023-07-21T11:57:35Z

keps/168-pending-workloads-visibility/README.md

+	TopList []LocalQueuePendingWorkload
+
+	// LastChangeTime indicates the time of the last change of the structure.
+	LastChangeTime metav1.Time


When this field is updated? Every 5 sec or every time (to indicate that the list is still current) there is a change in TopList? If the second - what is understood as a change?

Every time there is a change, but changes would be throttled so that there is at most one change per 5s. The change means an actual change in the object (could be a new workload added / deleted, or changed position).

Suggested change

LastChangeTime metav1.Time

LastChangeTimestamp metav1.Time

I have a slight preference towards just time, bacause:

it is shorter

it can be updated.

It seems that in k8s time is used in the condition fields (lastTransitionTime and lastProbeTime) which can be updated. Whereas timestamp is used for deletionTimestamp and creationTimestamp
which cannot be updated later. Not sure this is a deliberate k8s design, but something that occurs to me.

aleksandra-malinowska · 2023-07-21T14:02:45Z

keps/168-pending-workloads-visibility/README.md

+should assume not more that 500Ki is used.
+
+In order to mitigate this risk we put a constraint on the number of exposed
+pending workloads. We limit the number to 1000.


For objects that we expect to number in thousands or more in the cluster - such as LocalQueue may if there's one per user in a large organization, or one per experiment - I don't think it's a good idea to aim for nearly maxing out the object size. In such setup there'll likely be a need to list and watch queues, and watching large objects is much more expensive.

I think it would be good to keep the default size of objects much lower than that, even with high churn. If we can parametrize the number of pending workloads and set the default to some low number, that would allow the feature to both be enabled by default, and scalable by default. Users can then set a higher value if they prefer.

Yeah, I'm hesitant how to come up with estimations for the default. The etcd limit is something tangible we can use for estimations relatively easily. The situations with many LocalQueues, many watches on them and high churn, has many variables which are weirdly connected, that does not give me a good basis for estimations.

Also, I think of two scenarios:

local queues setup manually by an admin, where there is a handful of them and they are per team

local queues setup automatically, per user or per experiment

In the second case indeed we can expect many of them, but they will probably not have a high churn all the time. I suppose, most of them will be idle, if they are created per researcher. If they are created per experiement, their depth will not be too much anyway.

I think there is a natural analogy with Jobs, where for Indexed Jobs and Backoff Limit Per Index features the size of the status can be large, so we have limits for fields that control it. The approach we took was to setup limits so that it is roughly 500k object size as max.

I think the considerations between Jobs and LocalQueues are very similar:

The object size may grow with new functionalities, for example, we may want to add the ETA field per each workload,

The worst case scenario is unlikely in practice, since it assumes the lengths namespaces and names to be of max length of 253 characters.

Many instances of LocalQueue (or Job) are possible, they both can be watched

For reference calculations: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs#the-job-object-too-big

Following this strategy, I’m now leaning to set the default at 1000, which translates to roughly 500k.

FWIW, the number of LocalQueues is constrained by the number of namespaces, which is up to 10k: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md#kubernetes-thresholds

Another aspect here, that makes the LocalQueues different from Jobs is that here we need to take the read lock for the duration of the iteration over the global cluster queue. Thus, setting the lower limit for local queues makes sense actually.

Another aspect here, that makes the LocalQueues different from Jobs is that here we need to take the read lock for the duration of the iteration over the global cluster queue. Thus, setting the lower limit for local queues makes sense actually.

I have updated the proposal to take a periodic in-memory snapshot of ClusterQueue, this is to avoid taking a prolonged read lock on the ClusterQueue while generating the structure for LocalQueue.

I've added a section in Design Details on setting the defaults, and updated the sections on risks. @aleksandra-malinowska PTAL. In particular, I'm wondering if there is a way to translate the risk related to watches into some estimations. For now, I propose to set the limit to 1000 (following the Job Backoff Limit Per Index), but the default to 100, to reduce the chance of issues as the feature is enabled by defaults.

keps/168-pending-workloads-visibility/README.md

tenzen-y · 2023-08-03T20:52:21Z

keps/168-pending-workloads-visibility/README.md

+	// MaxCount indicates the maximal number of pending workloads exposed in the
+	// local queue status. The maximal value is 4000.
+	// Defaults to 10.
+	MaxCount int32


What happens If the MaxCount is set to 0? Is visibility disabled?
I think it would be useful to temporarily stop updating the positions when the cluster performance is much degraded.

Right, when the value is 0, then we don't send the updates for the corresponding queue type. Further, when both values are 0, then the feature is disabled, and snapshot isn't taken.

I think it would be useful to temporarily stop updating the positions when the cluster performance is much degraded.

Yes, but then Kueue would need to be able to diagnose the cluster performance is bad. Typically, this is a job for P&F.

Kueue on its side sets QPS limit to 20/s, but it wouldn't discriminate between updates for positions and other updates. An indirect way of controlling QPS for status updates is the UpdateInterval parameter.

I think to deal with large scale we may need another solution, such as On-demand http endpoint via Extension API Server

In general updates to ClusterQueue and LocalQueue are lower priority than updates to Workload or the jobs.

We could consider having a separate client for them.

Right, when the value is 0, then we don't send the updates for the corresponding queue type. Further, when both values are 0, then the feature is disabled, and snapshot isn't taken.

Thanks for the clarification :) SGTM

Yes, but then Kueue would need to be able to diagnose the cluster performance is bad. Typically, this is a job for P&F.

Kueue on its side sets QPS limit to 20/s, but it wouldn't discriminate between updates for positions and other updates. An indirect way of controlling QPS for status updates is the UpdateInterval parameter.

I think to deal with large scale we may need another solution, such as On-demand http endpoint via Extension API Server

I agree with you that we should set appropriate parameters for the APF.

In general updates to ClusterQueue and LocalQueue are lower priority than updates to Workload or the jobs.

We could consider having a separate client for them.

Oh, it's a great idea!

tenzen-y · 2023-08-03T21:19:45Z

keps/168-pending-workloads-visibility/README.md

+along with their positions. Then, the LocalQueue and ClusterQueue controllers
+do lookup into the cached structure.
+
+The snapshot is taken periodically in intervals configured in


Can we sync snapshots in stages, not all at once?
In the case of many clusterqueues and localqueues already existing, I guess the kube-apiserver load will explode, and a prolonged read lock will occur immediately after starting or restarting the kueue-controller-manager.

Good point, which was also raised earlier. I now propose to have a queue to snapshot-taking tasks. The queue will be processed by 5 workers. This should allow taking the snapshots in batches. Anyway, I think there is not so much risk as the read lock is only taken for the duration of copying the underlying heap data.

right, you take the data, but sorting happens outside of the lock. That could be relatively fast if you use copy.

mimowo · 2023-08-04T15:50:44Z

@mwielgus @alculquicondor I think I have addressed all the comments. Let me know if there is something more to be done or clarified.

tenzen-y · 2023-08-04T18:25:18Z

LGTM from my side.

alculquicondor · 2023-08-04T18:30:49Z

/approve
I'll leave the lgtm to @mwielgus

alculquicondor · 2023-08-04T18:30:55Z

/hold

mwielgus

/lgtm

k8s-ci-robot · 2023-08-08T10:07:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo, mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mwielgus · 2023-08-08T10:09:02Z

/remove-hold

k8s-ci-robot requested review from denkensk and trasc July 17, 2023 13:24

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 17, 2023

alculquicondor reviewed Jul 17, 2023

View reviewed changes

keps/168-expose-pending-workloads-order/README.md Outdated Show resolved Hide resolved

keps/168-expose-pending-workloads-order/README.md Outdated Show resolved Hide resolved

mimowo force-pushed the expose-queue-head branch 2 times, most recently from 53d26d1 to ab1387a Compare July 18, 2023 12:57

trasc reviewed Jul 18, 2023

View reviewed changes

tenzen-y reviewed Jul 18, 2023

View reviewed changes

mimowo force-pushed the expose-queue-head branch from ab1387a to 673d463 Compare July 18, 2023 15:53

alculquicondor reviewed Jul 18, 2023

View reviewed changes

mimowo force-pushed the expose-queue-head branch 2 times, most recently from 853ac7d to 07a3afb Compare July 19, 2023 15:52

mwielgus reviewed Jul 19, 2023

View reviewed changes

keps/168-pending-workloads-visibility/README.md Show resolved Hide resolved

mwielgus reviewed Jul 20, 2023

View reviewed changes

keps/168-pending-workloads-visibility/README.md Outdated Show resolved Hide resolved

mwielgus reviewed Jul 21, 2023

View reviewed changes

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 21, 2023

mimowo force-pushed the expose-queue-head branch from c434e96 to 4b6db68 Compare July 21, 2023 13:00

aleksandra-malinowska reviewed Jul 21, 2023

View reviewed changes

mimowo force-pushed the expose-queue-head branch 3 times, most recently from a2eb59d to f217873 Compare July 26, 2023 10:25

alculquicondor reviewed Aug 3, 2023

View reviewed changes

keps/168-pending-workloads-visibility/README.md Outdated Show resolved Hide resolved

tenzen-y reviewed Aug 3, 2023

View reviewed changes

mimowo force-pushed the expose-queue-head branch from 72d8df7 to 9a17870 Compare August 4, 2023 07:51

Expose the information about the order of pending workloads

cd55855

mimowo force-pushed the expose-queue-head branch from 9a17870 to cd55855 Compare August 4, 2023 08:00

mimowo added 2 commits August 4, 2023 10:09

remarks

3096bbb

Introduce the note about workers to update the snapshots

d87d7c1

mimowo force-pushed the expose-queue-head branch 2 times, most recently from 4abf00b to 0f004ba Compare August 4, 2023 12:16

3 phase feature

1b565f6

mimowo force-pushed the expose-queue-head branch from 0f004ba to 1b565f6 Compare August 4, 2023 12:21

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 4, 2023

mimowo changed the title ~~Expose the information about the order of pending workloads~~ KEP: Expose the order of pending workloads in queue statuses Aug 7, 2023

mimowo changed the title ~~KEP: Expose the order of pending workloads in queue statuses~~ KEP: Pending workloads visibility Aug 7, 2023

mimowo changed the title ~~KEP: Pending workloads visibility~~ KEP: Pending workloads visibility [expose in queue statuses] Aug 7, 2023

mwielgus approved these changes Aug 8, 2023

View reviewed changes

k8s-ci-robot assigned mwielgus Aug 8, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 8, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 8, 2023

k8s-ci-robot merged commit 5c60b9c into kubernetes-sigs:main Aug 8, 2023
4 checks passed

k8s-ci-robot added this to the v0.5 milestone Aug 8, 2023

stuton mentioned this pull request Aug 18, 2023

KEP 168: Visibility for cluster queue #1069

Merged

mimowo mentioned this pull request Nov 6, 2023

KEP-168-2: Pending workloads visibility #1300

Merged

mimowo deleted the expose-queue-head branch November 29, 2023 14:58


		### Goals

		- expose the order of workloads in the LocalQueue and ClusterQueue

KEP: Pending workloads visibility [expose in queue statuses] #991

KEP: Pending workloads visibility [expose in queue statuses] #991

Conversation

mimowo commented Jul 17, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

netlify bot commented Jul 17, 2023 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

alculquicondor commented Jul 17, 2023

Choose a reason for hiding this comment

mimowo Jul 18, 2023 • edited Loading

Choose a reason for hiding this comment

tenzen-y Jul 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Jul 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Jul 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Jul 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Aug 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Jul 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo commented Aug 4, 2023

tenzen-y commented Aug 4, 2023

alculquicondor commented Aug 4, 2023

alculquicondor commented Aug 4, 2023

mwielgus left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Aug 8, 2023

mwielgus commented Aug 8, 2023

mimowo commented Jul 17, 2023 •

edited

Loading

netlify bot commented Jul 17, 2023 •

edited

Loading

mimowo Jul 18, 2023 •

edited

Loading

tenzen-y Jul 18, 2023 •

edited

Loading

tenzen-y Jul 18, 2023 •

edited

Loading

mimowo Jul 26, 2023 •

edited

Loading

mimowo Jul 21, 2023 •

edited

Loading

mimowo Aug 1, 2023 •

edited

Loading

mimowo Jul 26, 2023 •

edited

Loading