Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Job Parallelism and Resource Scaling Based on Backlog Metrics #2964

Open
3 tasks
woehrl01 opened this issue Sep 3, 2024 · 5 comments
Open
3 tasks
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@woehrl01
Copy link
Contributor

woehrl01 commented Sep 3, 2024

What would you like to be added:

We would like to propose a new feature in Kueue that enables dynamic scaling of job parallelism and resource allocation (CPU, RAM, and pods) based on job backlog metrics and predefined formulas.

Idea: This feature would introduce a custom resource definition (CRD) that allows users to define scaling formulas and thresholds, which dynamically adjust the maximum parallelism and resource limits, similar to KEDA or HPA. A generic approach could be the exposing of the /scale subresource to have a generic interface.

Why is this needed:

Currently, we are processing around 4.5 million jobs per day, and managing resource usage and costs is critical. There is a need for a mechanism that can dynamically limit or expand the maximum parallelism of jobs based on real-time backlog conditions. This would help ensure that jobs are processed efficiently without overcommitting resources or incurring unnecessary costs.

By introducing a formula-based approach to flavor resources, we can achieve a more granular and responsive system. For example, the system could increase the max CPU or RAM allocation as the admission backlog grows, ensuring that delays are minimized during high-load periods while conserving resources during low-demand times. This functionality is crucial for maintaining both performance and cost-effectiveness in large-scale Kubernetes environments.

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@woehrl01 woehrl01 added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 3, 2024
@kannon92
Copy link
Contributor

kannon92 commented Sep 3, 2024

Reading the ask, I’m not entirely sure Kueue is the right place for this.

It sounds like you want metrics to influence elastic job scaling. AFAIK Kueue would help admit jobs based on dynamic scaling but I think the controller that looks at metrics and patches elastic jobs would probably be a separate CRD. I would think that this CRD would be separate from Kueue as it seems you want HPA at the job level.

@kannon92
Copy link
Contributor

kannon92 commented Sep 3, 2024

I'll leave the final decision on scope to @tenzen-y or @alculquicondor.

@alculquicondor
Copy link
Contributor

similar to KEDA or HPA

Why not just use KEDA or HPA?

I don't think Kueue is the right component to decide that.

Still there are two things that we need to do in Kueue to improve the experience:

  • Support job resizing, which was started but not completed.
  • Add any missing metrics about the length of the queues that can be fed into the external scaler.

And in Kubernetes:

  • Support the /scale sub resource for jobs (if this is the API you are targeting).

@mimowo
Copy link
Contributor

mimowo commented Sep 6, 2024

FYI the request to support dynamically scaled Jobs in Kueue: #77, it already has a KEP: https://github.com/kubernetes-sigs/kueue/tree/main/keps/77-dynamically-sized-jobs

@tenzen-y
Copy link
Member

tenzen-y commented Sep 6, 2024

FYI the request to support dynamically scaled Jobs in Kueue: #77, it already has a KEP: https://github.com/kubernetes-sigs/kueue/tree/main/keps/77-dynamically-sized-jobs

Yes, that's right. After we implement the feature, we may be able to use DynamicJob + Keda.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

6 participants
@alculquicondor @woehrl01 @kannon92 @mimowo @tenzen-y and others