-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: Pending workloads visibility [expose in queue statuses] #991
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
cc @trasc |
53d26d1
to
ab1387a
Compare
type PendingWorkloadsVisibility struct { | ||
// MaxPendingWorkloadsInStatus indicates the maximal number of pending | ||
// workloads for which their local queue order is exposed. | ||
// Defaults to 100. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe default to 0 therefore disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we would set it to 0? I think the feature is quite valuable so we should make it enabled by default. Still, we may consider going through alpha API to give it some time for testing.
IMO, if the performance is a concern, then we should make sure that the penalty is negligible, at least for sufficiently small values of N, but greater than 0 :).
// MaxPendingWorkloadsInStatus indicates the maximal number of pending | ||
// workloads for which their local queue order is exposed. | ||
// Defaults to 100. | ||
MaxTopPendingWorkloads *int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we limit this to avoid LocalQueue's object size exceeding 1.5MiB in etcd?
Maybe, we should say object's size in the Risks and Mitigations
section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limit in etcd is 1.5Mi, added a point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. My comment is a typo...
I fixed my comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I think it makes sense not to put a hard constraint. This is because, the constrain depends on the lenghts of the names and namespaces of workloads. If we follow the 253 characters limitation of Kubernetes, we may end up with max of 2000, which would be very constraining. Especially, I think many organizations may have internal policies to limit the number of characters for namespaces, for example, as they derive the namespace names for names of internal teams, etc.
I suggest to just expand the user-facing documentation that going beyond 2000 might be problematic and should be thought through. Added a note in the Risks and Mitigations
section.
ab1387a
to
673d463
Compare
|
||
### Goals | ||
|
||
- expose the order of workloads in the LocalQueue and ClusterQueue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- expose the order of workloads in the LocalQueue and ClusterQueue | |
- expose the order of workloads in the LocalQueue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposing the clusterqueue will allow the admin to have full visibility. It was suggested that the clusterqueue should be the higher priority work (#991 (comment)). Let me leave both for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, it is important to note that the state of the localQueues might be inconsistent at a given moment. For example, two different workloads may say, they have position 1, due to the throttling of the updates.
Now, other approaches may exist to expose the information for admins, however, with the problems solved for throttling and object size limiting, the solution should be easy to transfer. Seems much easier, than to do an alternative approach dedicated for admins. Adding a note in alternatives.
|
||
The proposal is to extend the APIs for the status of LocalQueue and ClusterQueue | ||
to expose the order to workloads. The order will be only exposed up to some | ||
configurable depth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we start with a fixed quantity? The limitation is on the size of the object. Why would users want less?
Instead, we could make configurable the rate at which the summary is updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we start with a fixed quantity? The limitation is on the size of the object.
Good point, I think I like it, will update accordingly
Why would users want less?
Exposing positional information may be triggering many updates, even if batched, there would be a request every 5s. With low number a user can reduce the rate of actual updates even more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason why very large threshold might be problematic, even having red-black trees (unless there are some versions allowing thread-safe iteration): if we iterate for long we need to take a read lock for the time of iteration. This can be problematic because it could mean lower throughput for update operations. One locked iteration per 5s should be ok, but allowing to limit the depth might be desirable too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, after a second thought, cutting by the object size might be tricky. In particular, how would we estimate the size of the actual object without serializing it? We could add manually sizes of individual fields without serializing, but it might be hard to maintain. Also, if the procedure of checking the object size is expensive computationally it will make the iteration longer, if checked after every added item.
|
||
// Position indicates the position of the workload among all pending | ||
// workloads in the cluster queue. | ||
Position *int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should call this approximatePosition
, for two reasons:
- by the time the position is posted, it's probably out of date.
- it gives us the chance to use coarse numbers:
1, 2, 3, ..., 10, 20, 30, ..., 100, 200, 300, ..., 1000
. This might potentially reduce the need for updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we actually need this? Without the access to the name of the other workload, the position is not particularly useful. The user will not know whether there is 1 huge job or 100 tiny before his task in the queue. At the same time, Kueue will be using a lot of APIServer write quota to maintain this information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should call this
approximatePosition
, for two reasons:
- by the time the position is posted, it's probably out of date.
If we batch updates and have 5s it does not seem relevant enough to name it approximate
. Similarly, we call failed
counter in Job controller despite some potential delay.
- it gives us the chance to use coarse numbers:
1, 2, 3, ..., 10, 20, 30, ..., 100, 200, 300, ..., 1000
. This might potentially reduce the need for updates.
I think we want coarse positions, we should probably mention a range rather than a specific positions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, what is the point of this number? With large amount of LocalQueue pointing to the same ClusterQueue, the bigger part of the write quota will be used for these updates, even with 5 sec batching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, what is the point of this number?
The point is to give the user some feedback about its position in the cluster queue (which the user does not see). This is analogous to the hot-line bots saying "your position is 5". As a caller you also don't know how long it would take for the other people calling to finish, however, you can get some estimate assuming 10min per caller. Similarly, in the context of Kueue you can get a rough estimate on how long it takes to process 5 items based on the historical velocity of the processing.
I think, ultimately along with the position we would like to have ETA. Getting ETA based on time execution per jobs might be involving (to collect data based on historical data per job). Still, we could have a rough ETA based on overall cluster queue velocity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I'm still not convinced (the benefits vs write cost are not clear), but maybe i'm wrong. What about adding it as a possible phase 2 feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, returning positions adds a complication to the implementation to throttle the updates. I think we should indeed start the implementation without positions for localqueues + cluster queues.
Certainly we should split the implementation into small commits / PRs, when it makes sense. As for the KEP, yes I'm not sure about the phase process. I also could split the KEP into two also (cluster queue KEP, and local queue KEP).
However, I think returning local queue order without even positioning is pointless. Returning position gives at least some clue, if you can estimate the queue velocity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I'm still not convinced (the benefits vs write cost are not clear), but maybe i'm wrong. What about adding it as a possible phase 2 feature?
I have split the implementation into phases now. PTAL.
853ac7d
to
07a3afb
Compare
|
||
### Goals | ||
|
||
- expose the order of workloads in the LocalQueue and ClusterQueue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, IMHO, we should rather start from ClusterQueue and add LocalQueue a bit later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess so, though some visibility into local queues seems needed sooner or later, so seems to make sense to discuss both in the same KEP. Also, we can split implementation into phases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale to start with ClusterQueue? given that they should be hidden from end users? Also they might have privacy concerns because the list would have information for multiple namespaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the rationale for starting with ClusterQueue is that it is simpler and has a lower potential perf impact.
I've updated the KEP to propose 2 graduation stages for the features. ClusterQueue for Beta and LocalQueue in the next phase for stable. Or we may have 2 stages for beta. Let me know wdyt.
TopList []LocalQueuePendingWorkload | ||
|
||
// LastChangeTime indicates the time of the last change of the structure. | ||
LastChangeTime metav1.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this field is updated? Every 5 sec or every time (to indicate that the list is still current) there is a change in TopList? If the second - what is understood as a change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every time there is a change, but changes would be throttled so that there is at most one change per 5s. The change means an actual change in the object (could be a new workload added / deleted, or changed position).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LastChangeTime metav1.Time | |
LastChangeTimestamp metav1.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a slight preference towards just time
, bacause:
- it is shorter
- it can be updated.
It seems that in k8s time
is used in the condition fields (lastTransitionTime
and lastProbeTime
) which can be updated. Whereas timestamp
is used for deletionTimestamp
and creationTimestamp
which cannot be updated later. Not sure this is a deliberate k8s design, but something that occurs to me.
c434e96
to
4b6db68
Compare
should assume not more that 500Ki is used. | ||
|
||
In order to mitigate this risk we put a constraint on the number of exposed | ||
pending workloads. We limit the number to 1000. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For objects that we expect to number in thousands or more in the cluster - such as LocalQueue may if there's one per user in a large organization, or one per experiment - I don't think it's a good idea to aim for nearly maxing out the object size. In such setup there'll likely be a need to list and watch queues, and watching large objects is much more expensive.
I think it would be good to keep the default size of objects much lower than that, even with high churn. If we can parametrize the number of pending workloads and set the default to some low number, that would allow the feature to both be enabled by default, and scalable by default. Users can then set a higher value if they prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm hesitant how to come up with estimations for the default. The etcd limit is something tangible we can use for estimations relatively easily. The situations with many LocalQueues, many watches on them and high churn, has many variables which are weirdly connected, that does not give me a good basis for estimations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think of two scenarios:
- local queues setup manually by an admin, where there is a handful of them and they are per team
- local queues setup automatically, per user or per experiment
In the second case indeed we can expect many of them, but they will probably not have a high churn all the time. I suppose, most of them will be idle, if they are created per researcher. If they are created per experiement, their depth will not be too much anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a natural analogy with Jobs, where for Indexed Jobs
and Backoff Limit Per Index
features the size of the status can be large, so we have limits for fields that control it. The approach we took was to setup limits so that it is roughly 500k object size as max.
I think the considerations between Jobs and LocalQueues are very similar:
- The object size may grow with new functionalities, for example, we may want to add the ETA field per each workload,
- The worst case scenario is unlikely in practice, since it assumes the lengths namespaces and names to be of max length of 253 characters.
- Many instances of LocalQueue (or Job) are possible, they both can be watched
For reference calculations: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs#the-job-object-too-big
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following this strategy, I’m now leaning to set the default at 1000, which translates to roughly 500k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, the number of LocalQueues is constrained by the number of namespaces, which is up to 10k: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md#kubernetes-thresholds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another aspect here, that makes the LocalQueues different from Jobs is that here we need to take the read lock for the duration of the iteration over the global cluster queue. Thus, setting the lower limit for local queues makes sense actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another aspect here, that makes the LocalQueues different from Jobs is that here we need to take the read lock for the duration of the iteration over the global cluster queue. Thus, setting the lower limit for local queues makes sense actually.
I have updated the proposal to take a periodic in-memory snapshot of ClusterQueue, this is to avoid taking a prolonged read lock on the ClusterQueue while generating the structure for LocalQueue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a section in Design Details on setting the defaults, and updated the sections on risks. @aleksandra-malinowska PTAL. In particular, I'm wondering if there is a way to translate the risk related to watches into some estimations. For now, I propose to set the limit to 1000 (following the Job Backoff Limit Per Index), but the default to 100, to reduce the chance of issues as the feature is enabled by defaults.
a2eb59d
to
f217873
Compare
// MaxCount indicates the maximal number of pending workloads exposed in the | ||
// local queue status. The maximal value is 4000. | ||
// Defaults to 10. | ||
MaxCount int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens If the MaxCount
is set to 0? Is visibility disabled?
I think it would be useful to temporarily stop updating the positions when the cluster performance is much degraded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, when the value is 0, then we don't send the updates for the corresponding queue type. Further, when both values are 0, then the feature is disabled, and snapshot isn't taken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful to temporarily stop updating the positions when the cluster performance is much degraded.
Yes, but then Kueue would need to be able to diagnose the cluster performance is bad. Typically, this is a job for P&F.
Kueue on its side sets QPS limit to 20/s, but it wouldn't discriminate between updates for positions and other updates. An indirect way of controlling QPS for status updates is the UpdateInterval
parameter.
I think to deal with large scale we may need another solution, such as On-demand http endpoint
via Extension API Server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general updates to ClusterQueue and LocalQueue are lower priority than updates to Workload or the jobs.
We could consider having a separate client for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, when the value is 0, then we don't send the updates for the corresponding queue type. Further, when both values are 0, then the feature is disabled, and snapshot isn't taken.
Thanks for the clarification :) SGTM
Yes, but then Kueue would need to be able to diagnose the cluster performance is bad. Typically, this is a job for P&F.
Kueue on its side sets QPS limit to 20/s, but it wouldn't discriminate between updates for positions and other updates. An indirect way of controlling QPS for status updates is the UpdateInterval parameter.
I think to deal with large scale we may need another solution, such as On-demand http endpoint via Extension API Server
I agree with you that we should set appropriate parameters for the APF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general updates to ClusterQueue and LocalQueue are lower priority than updates to Workload or the jobs.
We could consider having a separate client for them.
Oh, it's a great idea!
along with their positions. Then, the LocalQueue and ClusterQueue controllers | ||
do lookup into the cached structure. | ||
|
||
The snapshot is taken periodically in intervals configured in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we sync snapshots in stages, not all at once?
In the case of many clusterqueues and localqueues already existing, I guess the kube-apiserver load will explode, and a prolonged read lock will occur immediately after starting or restarting the kueue-controller-manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, which was also raised earlier. I now propose to have a queue to snapshot-taking tasks. The queue will be processed by 5 workers. This should allow taking the snapshots in batches. Anyway, I think there is not so much risk as the read lock is only taken for the duration of copying the underlying heap data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, you take the data, but sorting happens outside of the lock. That could be relatively fast if you use copy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
72d8df7
to
9a17870
Compare
9a17870
to
cd55855
Compare
4abf00b
to
0f004ba
Compare
0f004ba
to
1b565f6
Compare
@mwielgus @alculquicondor I think I have addressed all the comments. Let me know if there is something more to be done or clarified. |
LGTM from my side. |
/approve |
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, mimowo, mwielgus The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/remove-hold |
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #168
Special notes for your reviewer:
Does this PR introduce a user-facing change?