Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: Pending workloads visibility [expose in queue statuses] #991

Merged
merged 4 commits into from
Aug 8, 2023

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Jul 17, 2023

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #168

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 17, 2023
@netlify
Copy link

netlify bot commented Jul 17, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 1b565f6
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/64cced587cccd10008cdc66d

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 17, 2023
@alculquicondor
Copy link
Contributor

cc @trasc

@mimowo mimowo force-pushed the expose-queue-head branch 2 times, most recently from 53d26d1 to ab1387a Compare July 18, 2023 12:57
keps/168-pending-workloads-visibility/README.md Outdated Show resolved Hide resolved
keps/168-pending-workloads-visibility/README.md Outdated Show resolved Hide resolved
keps/168-pending-workloads-visibility/README.md Outdated Show resolved Hide resolved
type PendingWorkloadsVisibility struct {
// MaxPendingWorkloadsInStatus indicates the maximal number of pending
// workloads for which their local queue order is exposed.
// Defaults to 100.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe default to 0 therefore disabled.

Copy link
Contributor Author

@mimowo mimowo Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we would set it to 0? I think the feature is quite valuable so we should make it enabled by default. Still, we may consider going through alpha API to give it some time for testing.

IMO, if the performance is a concern, then we should make sure that the penalty is negligible, at least for sufficiently small values of N, but greater than 0 :).

// MaxPendingWorkloadsInStatus indicates the maximal number of pending
// workloads for which their local queue order is exposed.
// Defaults to 100.
MaxTopPendingWorkloads *int32
Copy link
Member

@tenzen-y tenzen-y Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we limit this to avoid LocalQueue's object size exceeding 1.5MiB in etcd?
Maybe, we should say object's size in the Risks and Mitigations section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit in etcd is 1.5Mi, added a point

Copy link
Member

@tenzen-y tenzen-y Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. My comment is a typo...
I fixed my comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think it makes sense not to put a hard constraint. This is because, the constrain depends on the lenghts of the names and namespaces of workloads. If we follow the 253 characters limitation of Kubernetes, we may end up with max of 2000, which would be very constraining. Especially, I think many organizations may have internal policies to limit the number of characters for namespaces, for example, as they derive the namespace names for names of internal teams, etc.

I suggest to just expand the user-facing documentation that going beyond 2000 might be problematic and should be thought through. Added a note in the Risks and Mitigations section.

keps/168-pending-workloads-visibility/README.md Outdated Show resolved Hide resolved

### Goals

- expose the order of workloads in the LocalQueue and ClusterQueue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- expose the order of workloads in the LocalQueue and ClusterQueue
- expose the order of workloads in the LocalQueue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing the clusterqueue will allow the admin to have full visibility. It was suggested that the clusterqueue should be the higher priority work (#991 (comment)). Let me leave both for now.

Copy link
Contributor Author

@mimowo mimowo Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, it is important to note that the state of the localQueues might be inconsistent at a given moment. For example, two different workloads may say, they have position 1, due to the throttling of the updates.

Now, other approaches may exist to expose the information for admins, however, with the problems solved for throttling and object size limiting, the solution should be easy to transfer. Seems much easier, than to do an alternative approach dedicated for admins. Adding a note in alternatives.


The proposal is to extend the APIs for the status of LocalQueue and ClusterQueue
to expose the order to workloads. The order will be only exposed up to some
configurable depth.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we start with a fixed quantity? The limitation is on the size of the object. Why would users want less?
Instead, we could make configurable the rate at which the summary is updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we start with a fixed quantity? The limitation is on the size of the object.

Good point, I think I like it, will update accordingly

Why would users want less?

Exposing positional information may be triggering many updates, even if batched, there would be a request every 5s. With low number a user can reduce the rate of actual updates even more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason why very large threshold might be problematic, even having red-black trees (unless there are some versions allowing thread-safe iteration): if we iterate for long we need to take a read lock for the time of iteration. This can be problematic because it could mean lower throughput for update operations. One locked iteration per 5s should be ok, but allowing to limit the depth might be desirable too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, after a second thought, cutting by the object size might be tricky. In particular, how would we estimate the size of the actual object without serializing it? We could add manually sizes of individual fields without serializing, but it might be hard to maintain. Also, if the procedure of checking the object size is expensive computationally it will make the iteration longer, if checked after every added item.

keps/168-pending-workloads-visibility/README.md Outdated Show resolved Hide resolved
keps/168-pending-workloads-visibility/README.md Outdated Show resolved Hide resolved
keps/168-pending-workloads-visibility/README.md Outdated Show resolved Hide resolved

// Position indicates the position of the workload among all pending
// workloads in the cluster queue.
Position *int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should call this approximatePosition, for two reasons:

  • by the time the position is posted, it's probably out of date.
  • it gives us the chance to use coarse numbers: 1, 2, 3, ..., 10, 20, 30, ..., 100, 200, 300, ..., 1000. This might potentially reduce the need for updates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we actually need this? Without the access to the name of the other workload, the position is not particularly useful. The user will not know whether there is 1 huge job or 100 tiny before his task in the queue. At the same time, Kueue will be using a lot of APIServer write quota to maintain this information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should call this approximatePosition, for two reasons:

  • by the time the position is posted, it's probably out of date.

If we batch updates and have 5s it does not seem relevant enough to name it approximate. Similarly, we call failed counter in Job controller despite some potential delay.

  • it gives us the chance to use coarse numbers: 1, 2, 3, ..., 10, 20, 30, ..., 100, 200, 300, ..., 1000. This might potentially reduce the need for updates.

I think we want coarse positions, we should probably mention a range rather than a specific positions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, what is the point of this number? With large amount of LocalQueue pointing to the same ClusterQueue, the bigger part of the write quota will be used for these updates, even with 5 sec batching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, what is the point of this number?

The point is to give the user some feedback about its position in the cluster queue (which the user does not see). This is analogous to the hot-line bots saying "your position is 5". As a caller you also don't know how long it would take for the other people calling to finish, however, you can get some estimate assuming 10min per caller. Similarly, in the context of Kueue you can get a rough estimate on how long it takes to process 5 items based on the historical velocity of the processing.

I think, ultimately along with the position we would like to have ETA. Getting ETA based on time execution per jobs might be involving (to collect data based on historical data per job). Still, we could have a rough ETA based on overall cluster queue velocity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm still not convinced (the benefits vs write cost are not clear), but maybe i'm wrong. What about adding it as a possible phase 2 feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, returning positions adds a complication to the implementation to throttle the updates. I think we should indeed start the implementation without positions for localqueues + cluster queues.

Certainly we should split the implementation into small commits / PRs, when it makes sense. As for the KEP, yes I'm not sure about the phase process. I also could split the KEP into two also (cluster queue KEP, and local queue KEP).

However, I think returning local queue order without even positioning is pointless. Returning position gives at least some clue, if you can estimate the queue velocity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm still not convinced (the benefits vs write cost are not clear), but maybe i'm wrong. What about adding it as a possible phase 2 feature?

I have split the implementation into phases now. PTAL.

@mimowo mimowo force-pushed the expose-queue-head branch 2 times, most recently from 853ac7d to 07a3afb Compare July 19, 2023 15:52

### Goals

- expose the order of workloads in the LocalQueue and ClusterQueue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, IMHO, we should rather start from ClusterQueue and add LocalQueue a bit later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so, though some visibility into local queues seems needed sooner or later, so seems to make sense to discuss both in the same KEP. Also, we can split implementation into phases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale to start with ClusterQueue? given that they should be hidden from end users? Also they might have privacy concerns because the list would have information for multiple namespaces?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rationale for starting with ClusterQueue is that it is simpler and has a lower potential perf impact.

I've updated the KEP to propose 2 graduation stages for the features. ClusterQueue for Beta and LocalQueue in the next phase for stable. Or we may have 2 stages for beta. Let me know wdyt.

TopList []LocalQueuePendingWorkload

// LastChangeTime indicates the time of the last change of the structure.
LastChangeTime metav1.Time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this field is updated? Every 5 sec or every time (to indicate that the list is still current) there is a change in TopList? If the second - what is understood as a change?

Copy link
Contributor Author

@mimowo mimowo Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time there is a change, but changes would be throttled so that there is at most one change per 5s. The change means an actual change in the object (could be a new workload added / deleted, or changed position).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LastChangeTime metav1.Time
LastChangeTimestamp metav1.Time

Copy link
Contributor Author

@mimowo mimowo Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference towards just time, bacause:

  1. it is shorter
  2. it can be updated.

It seems that in k8s time is used in the condition fields (lastTransitionTime and lastProbeTime) which can be updated. Whereas timestamp is used for deletionTimestamp and creationTimestamp
which cannot be updated later. Not sure this is a deliberate k8s design, but something that occurs to me.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 21, 2023
should assume not more that 500Ki is used.

In order to mitigate this risk we put a constraint on the number of exposed
pending workloads. We limit the number to 1000.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For objects that we expect to number in thousands or more in the cluster - such as LocalQueue may if there's one per user in a large organization, or one per experiment - I don't think it's a good idea to aim for nearly maxing out the object size. In such setup there'll likely be a need to list and watch queues, and watching large objects is much more expensive.

I think it would be good to keep the default size of objects much lower than that, even with high churn. If we can parametrize the number of pending workloads and set the default to some low number, that would allow the feature to both be enabled by default, and scalable by default. Users can then set a higher value if they prefer.

Copy link
Contributor Author

@mimowo mimowo Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm hesitant how to come up with estimations for the default. The etcd limit is something tangible we can use for estimations relatively easily. The situations with many LocalQueues, many watches on them and high churn, has many variables which are weirdly connected, that does not give me a good basis for estimations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think of two scenarios:

  • local queues setup manually by an admin, where there is a handful of them and they are per team
  • local queues setup automatically, per user or per experiment

In the second case indeed we can expect many of them, but they will probably not have a high churn all the time. I suppose, most of them will be idle, if they are created per researcher. If they are created per experiement, their depth will not be too much anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a natural analogy with Jobs, where for Indexed Jobs and Backoff Limit Per Index features the size of the status can be large, so we have limits for fields that control it. The approach we took was to setup limits so that it is roughly 500k object size as max.

I think the considerations between Jobs and LocalQueues are very similar:

  • The object size may grow with new functionalities, for example, we may want to add the ETA field per each workload,
  • The worst case scenario is unlikely in practice, since it assumes the lengths namespaces and names to be of max length of 253 characters.
  • Many instances of LocalQueue (or Job) are possible, they both can be watched

For reference calculations: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs#the-job-object-too-big

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following this strategy, I’m now leaning to set the default at 1000, which translates to roughly 500k.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the number of LocalQueues is constrained by the number of namespaces, which is up to 10k: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md#kubernetes-thresholds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another aspect here, that makes the LocalQueues different from Jobs is that here we need to take the read lock for the duration of the iteration over the global cluster queue. Thus, setting the lower limit for local queues makes sense actually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another aspect here, that makes the LocalQueues different from Jobs is that here we need to take the read lock for the duration of the iteration over the global cluster queue. Thus, setting the lower limit for local queues makes sense actually.

I have updated the proposal to take a periodic in-memory snapshot of ClusterQueue, this is to avoid taking a prolonged read lock on the ClusterQueue while generating the structure for LocalQueue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a section in Design Details on setting the defaults, and updated the sections on risks. @aleksandra-malinowska PTAL. In particular, I'm wondering if there is a way to translate the risk related to watches into some estimations. For now, I propose to set the limit to 1000 (following the Job Backoff Limit Per Index), but the default to 100, to reduce the chance of issues as the feature is enabled by defaults.

@mimowo mimowo force-pushed the expose-queue-head branch 3 times, most recently from a2eb59d to f217873 Compare July 26, 2023 10:25
// MaxCount indicates the maximal number of pending workloads exposed in the
// local queue status. The maximal value is 4000.
// Defaults to 10.
MaxCount int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens If the MaxCount is set to 0? Is visibility disabled?
I think it would be useful to temporarily stop updating the positions when the cluster performance is much degraded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, when the value is 0, then we don't send the updates for the corresponding queue type. Further, when both values are 0, then the feature is disabled, and snapshot isn't taken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to temporarily stop updating the positions when the cluster performance is much degraded.

Yes, but then Kueue would need to be able to diagnose the cluster performance is bad. Typically, this is a job for P&F.

Kueue on its side sets QPS limit to 20/s, but it wouldn't discriminate between updates for positions and other updates. An indirect way of controlling QPS for status updates is the UpdateInterval parameter.

I think to deal with large scale we may need another solution, such as On-demand http endpoint via Extension API Server

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general updates to ClusterQueue and LocalQueue are lower priority than updates to Workload or the jobs.

We could consider having a separate client for them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, when the value is 0, then we don't send the updates for the corresponding queue type. Further, when both values are 0, then the feature is disabled, and snapshot isn't taken.

Thanks for the clarification :) SGTM

Yes, but then Kueue would need to be able to diagnose the cluster performance is bad. Typically, this is a job for P&F.

Kueue on its side sets QPS limit to 20/s, but it wouldn't discriminate between updates for positions and other updates. An indirect way of controlling QPS for status updates is the UpdateInterval parameter.

I think to deal with large scale we may need another solution, such as On-demand http endpoint via Extension API Server

I agree with you that we should set appropriate parameters for the APF.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general updates to ClusterQueue and LocalQueue are lower priority than updates to Workload or the jobs.

We could consider having a separate client for them.

Oh, it's a great idea!

along with their positions. Then, the LocalQueue and ClusterQueue controllers
do lookup into the cached structure.

The snapshot is taken periodically in intervals configured in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we sync snapshots in stages, not all at once?
In the case of many clusterqueues and localqueues already existing, I guess the kube-apiserver load will explode, and a prolonged read lock will occur immediately after starting or restarting the kueue-controller-manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, which was also raised earlier. I now propose to have a queue to snapshot-taking tasks. The queue will be processed by 5 workers. This should allow taking the snapshots in batches. Anyway, I think there is not so much risk as the read lock is only taken for the duration of copying the underlying heap data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, you take the data, but sorting happens outside of the lock. That could be relatively fast if you use copy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@mimowo mimowo force-pushed the expose-queue-head branch 2 times, most recently from 4abf00b to 0f004ba Compare August 4, 2023 12:16
@mimowo
Copy link
Contributor Author

mimowo commented Aug 4, 2023

@mwielgus @alculquicondor I think I have addressed all the comments. Let me know if there is something more to be done or clarified.

@tenzen-y
Copy link
Member

tenzen-y commented Aug 4, 2023

LGTM from my side.

@alculquicondor
Copy link
Contributor

/approve
I'll leave the lgtm to @mwielgus

@alculquicondor
Copy link
Contributor

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 4, 2023
@mimowo mimowo changed the title Expose the information about the order of pending workloads KEP: Expose the order of pending workloads in queue statuses Aug 7, 2023
@mimowo mimowo changed the title KEP: Expose the order of pending workloads in queue statuses KEP: Pending workloads visibility Aug 7, 2023
@mimowo mimowo changed the title KEP: Pending workloads visibility KEP: Pending workloads visibility [expose in queue statuses] Aug 7, 2023
Copy link
Contributor

@mwielgus mwielgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 8, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo, mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mwielgus
Copy link
Contributor

mwielgus commented Aug 8, 2023

/remove-hold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 8, 2023
@k8s-ci-robot k8s-ci-robot merged commit 5c60b9c into kubernetes-sigs:main Aug 8, 2023
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.5 milestone Aug 8, 2023
@mimowo mimowo deleted the expose-queue-head branch November 29, 2023 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A summary of Workloads order in queues
8 participants