Dynamic Job Parallelism and Resource Scaling Based on Backlog Metrics #2964

woehrl01 · 2024-09-03T06:13:11Z

What would you like to be added:

We would like to propose a new feature in Kueue that enables dynamic scaling of job parallelism and resource allocation (CPU, RAM, and pods) based on job backlog metrics and predefined formulas.

Idea: This feature would introduce a custom resource definition (CRD) that allows users to define scaling formulas and thresholds, which dynamically adjust the maximum parallelism and resource limits, similar to KEDA or HPA. A generic approach could be the exposing of the /scale subresource to have a generic interface.

Why is this needed:

Currently, we are processing around 4.5 million jobs per day, and managing resource usage and costs is critical. There is a need for a mechanism that can dynamically limit or expand the maximum parallelism of jobs based on real-time backlog conditions. This would help ensure that jobs are processed efficiently without overcommitting resources or incurring unnecessary costs.

By introducing a formula-based approach to flavor resources, we can achieve a more granular and responsive system. For example, the system could increase the max CPU or RAM allocation as the admission backlog grows, ensuring that delays are minimized during high-load periods while conserving resources during low-demand times. This functionality is crucial for maintaining both performance and cost-effectiveness in large-scale Kubernetes environments.

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

The text was updated successfully, but these errors were encountered:

kannon92 · 2024-09-03T12:55:09Z

Reading the ask, I’m not entirely sure Kueue is the right place for this.

It sounds like you want metrics to influence elastic job scaling. AFAIK Kueue would help admit jobs based on dynamic scaling but I think the controller that looks at metrics and patches elastic jobs would probably be a separate CRD. I would think that this CRD would be separate from Kueue as it seems you want HPA at the job level.

kannon92 · 2024-09-03T17:56:09Z

I'll leave the final decision on scope to @tenzen-y or @alculquicondor.

alculquicondor · 2024-09-03T19:01:54Z

similar to KEDA or HPA

Why not just use KEDA or HPA?

I don't think Kueue is the right component to decide that.

Still there are two things that we need to do in Kueue to improve the experience:

Support job resizing, which was started but not completed.
Add any missing metrics about the length of the queues that can be fed into the external scaler.

And in Kubernetes:

Support the /scale sub resource for jobs (if this is the API you are targeting).

mimowo · 2024-09-06T10:08:13Z

FYI the request to support dynamically scaled Jobs in Kueue: #77, it already has a KEP: https://github.com/kubernetes-sigs/kueue/tree/main/keps/77-dynamically-sized-jobs

tenzen-y · 2024-09-06T10:12:11Z

FYI the request to support dynamically scaled Jobs in Kueue: #77, it already has a KEP: https://github.com/kubernetes-sigs/kueue/tree/main/keps/77-dynamically-sized-jobs

Yes, that's right. After we implement the feature, we may be able to use DynamicJob + Keda.

woehrl01 added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic Job Parallelism and Resource Scaling Based on Backlog Metrics #2964

Dynamic Job Parallelism and Resource Scaling Based on Backlog Metrics #2964

woehrl01 commented Sep 3, 2024 •

edited

Loading

kannon92 commented Sep 3, 2024 •

edited

Loading

kannon92 commented Sep 3, 2024

alculquicondor commented Sep 3, 2024

mimowo commented Sep 6, 2024

tenzen-y commented Sep 6, 2024

Dynamic Job Parallelism and Resource Scaling Based on Backlog Metrics #2964

Dynamic Job Parallelism and Resource Scaling Based on Backlog Metrics #2964

Comments

woehrl01 commented Sep 3, 2024 • edited Loading

kannon92 commented Sep 3, 2024 • edited Loading

kannon92 commented Sep 3, 2024

alculquicondor commented Sep 3, 2024

mimowo commented Sep 6, 2024

tenzen-y commented Sep 6, 2024

woehrl01 commented Sep 3, 2024 •

edited

Loading

kannon92 commented Sep 3, 2024 •

edited

Loading