Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dynamically sized (elastic) jobs #77

Open
ahg-g opened this issue Feb 26, 2022 · 11 comments · Fixed by #1851
Open

Support dynamically sized (elastic) jobs #77

ahg-g opened this issue Feb 26, 2022 · 11 comments · Fixed by #1851
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. kind/grand-feature lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@ahg-g
Copy link
Contributor

ahg-g commented Feb 26, 2022

We should have a clear path towards support spark and other dynamically sized jobs. Another example of this is Ray.

One related aspect is to support dynamically updating the resource requirements of a workload, we can probably limit that to support changing the count of a PodSet in QueuedWorkload (in Spark, the number of workers could change during the runtime of the job, but not the resource requirements of a worker).

One idea is to model it in a way similar to "in-place update to pod resources" [1], but in our case it would be the count that is mutable. The driver pod in spark would be watching for the corresponding QueuedWorkload instance and adjusts the number of workers when the new count is admitted.

[1] https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources

@ahg-g ahg-g added kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Feb 26, 2022
@alculquicondor alculquicondor changed the title Support Spark jobs Support Spark jobs and dynamic sized jobs Mar 17, 2022
@alculquicondor alculquicondor changed the title Support Spark jobs and dynamic sized jobs Support Spark jobs and dynamically sized jobs Mar 17, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2022
@kerthcet
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2022
@alculquicondor
Copy link
Contributor

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jul 12, 2022
@alculquicondor
Copy link
Contributor

@alculquicondor alculquicondor mentioned this issue Jul 18, 2023
3 tasks
@alculquicondor alculquicondor changed the title Support Spark jobs and dynamically sized jobs Support dynamically sized (elastic) jobs Aug 25, 2023
@andrewsykim
Copy link
Member

I am interested in working on this -- this probably needs some sort of design doc, will work with @alculquicondor and see if I can put something together in the next few weeks

/assign

@tenzen-y
Copy link
Member

tenzen-y commented Dec 7, 2023

I am interested in working on this -- this probably needs some sort of design doc, will work with @alculquicondor and see if I can put something together in the next few weeks

/assign

Hi @andrewsykim! Is there any progress?

@andrewsykim
Copy link
Member

@tenzen-y I was planning to work on this in a couple weeks during the holiday season, but feel free to start working on this if you're interested.

@tenzen-y
Copy link
Member

tenzen-y commented Dec 8, 2023

@andrewsykim Thanks. I also don't have enough time now. So, when I can get enough time, I will ask for progress again.

@andrewsykim
Copy link
Member

FYI @vicentefb and I are working on a proposal in a google doc, we will share it here soon when it's ready

k8s-ci-robot pushed a commit that referenced this issue Apr 3, 2024
* added kep

* kep updated

applied toc

* updated kep

* toc updated

* added info in unit tests and integration tests section

* added details about workload slices

* rephrase scale down section

* updated and added details on slices, generalized design details and typos

* update

* added details about mutikueue and removed users from approvers
@tenzen-y
Copy link
Member

tenzen-y commented Apr 3, 2024

/reopen

@k8s-ci-robot
Copy link
Contributor

@tenzen-y: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Apr 3, 2024
vsoch pushed a commit to researchapps/kueue that referenced this issue Apr 18, 2024
* added kep

* kep updated

applied toc

* updated kep

* toc updated

* added info in unit tests and integration tests section

* added details about workload slices

* rephrase scale down section

* updated and added details on slices, generalized design details and typos

* update

* added details about mutikueue and removed users from approvers
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this issue Nov 19, 2024
* added kep

* kep updated

applied toc

* updated kep

* toc updated

* added info in unit tests and integration tests section

* added details about workload slices

* rephrase scale down section

* updated and added details on slices, generalized design details and typos

* update

* added details about mutikueue and removed users from approvers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. kind/grand-feature lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants