From e54d960cf1d768adb0aadf3eae684646b2a7ec10 Mon Sep 17 00:00:00 2001 From: Prasad Saraf Date: Mon, 15 Aug 2022 20:20:01 +0530 Subject: [PATCH 1/5] KEP-78: Dynamically reclaiming resources --- .../README.md | 122 ++++++++++++++++++ .../kep.yaml | 18 +++ 2 files changed, 140 insertions(+) create mode 100644 enhancements/78-dynamically-reclaiming-resources/README.md create mode 100644 enhancements/78-dynamically-reclaiming-resources/kep.yaml diff --git a/enhancements/78-dynamically-reclaiming-resources/README.md b/enhancements/78-dynamically-reclaiming-resources/README.md new file mode 100644 index 0000000000..62f19ef52c --- /dev/null +++ b/enhancements/78-dynamically-reclaiming-resources/README.md @@ -0,0 +1,122 @@ +# Dynamically reclaim resources of Pods of a Workload + +## Table of Contents + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Risks and Mitigations](#risks-and-mitigations) +- [API](#api) +- [Graduation Criteria](#graduation-criteria) +- [Testing Plan](#testing-plan) + - [Unit Tests](#unit-tests) + - [Integration tests](#integration-tests) + - [E2E tests](#e2e-tests) +- [Implementation History](#implementation-history) + + +## Summary + +The resources of pods of a Workload are reclaimed by kueue after successful completion of the Pod. +This is beneficial for workloads which cannot proceed due to the unavailability +of the little resources for its corresponding pods and Workload does not get +admitted. + +Here after, *pods of a Workload* are synonymously used to refer *pods of a Job* and vice-versa because, +Job manages the Workload and the pods associated with Job are associated with the Workload. + + +## Motivation + +Currently, the resources owned by a Job are reclaimed by kueue only when the whole Job finishes. +For jobs with multiple pods, the resources will be reclaimed only after the last Pod +of the Job finishes. This is not efficient as the pods of parallel Job may have laggards +consuming the unused resources of the jobs those are currently executing. + +### Goals + +- Utilize the unused resources of the Job which running in kueue. + +### Non-Goals + +- A running Job on queue should not be preempted. +- The resources that are being used by a Job should not be allocated to any other Job. + +## Proposal + +Reclaim the resources of the succeeded pods of a Job as soon as it completes its execution. + +Add a new field to the status of the Workload to count successful completion of +pods of the Workload. The Job controller monitors the count of pods which have +successfully completed its execution. The Job controller is responsible for updating +the status of the Workload object appropriately. + +Update our documentation to reflect the new enhancement. + +### Risks and Mitigations + +There is a change in the functionality of reclaiming the cluster-queue resources. +The resources of a Job are reclaimed by kueue in an incremental manner(resources of each +Pod of a Job are reclaimed separately), as opposed to reclaiming the whole resources +of the Job after its completion. + +The update logic to reclaim resources after a Job completion must be handled correctly. + +### API + +A new field `CompletedPods` is added to the `WorkloadStatus` struct. + +```go +// WorkloadStatus defines the observed state of Workload +type WorkloadStatus struct { + // conditions hold the latest available observations of the Workload + // current state. + // +optional + // +listType=map + // +listMapKey=type + Conditions []WorkloadCondition `json:"conditions,omitempty"` + + // The number of pods which reached phase Succeeded or Failed. + // +optional + CompletedPods int32 `json:"completedPods"` +} +``` + +#### CompletedPods + +`CompletedPods` stores the count of pods which have successfully completed its execution. + +High level execution flow: +1. The completed pods in a Job is the count of Succeeded pods => sum(succeeded). +2. Update the Workload status object if job.sum(Succeeded) > wl.Status.CompletedPods +3. Handle the update event of Workload and update Workload status in cache. +4. Update clusterQueue quota for resource flavor for requests of Pod of a Workload in cache. + +## Graduation Criteria + +* The features have been stable and reliable in the past several releases. +* Adequate documentation exists for the features. +* Test coverage of the features is acceptable. + +## Testing Plan + +Dynamically reclaiming resources enhancement has unit and integration tests. These tests +are run regularly as a part of kueue's prow CI/CD pipeline. + +### Unit Tests +Here is a list of unit tests for various modules of the feature: +* [Cluster-Queue cache tests](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/cache/cache_test.go) +* [Workload tests](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/workload/workload_test.go) + +### Integration tests +* Integration tests for Job controller are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/job/job_controller_test.go) +* Integration tests for Workload controller are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/core/workload_controller_test.go) + +## Implementation History + +Dynamically Reclaiming Resources are tracked as part of [enhancement#78](https://github.com/kubernetes-sigs/kueue/issues/78). + +**TODO** - Add proposal link \ No newline at end of file diff --git a/enhancements/78-dynamically-reclaiming-resources/kep.yaml b/enhancements/78-dynamically-reclaiming-resources/kep.yaml new file mode 100644 index 0000000000..8a11ed61ec --- /dev/null +++ b/enhancements/78-dynamically-reclaiming-resources/kep.yaml @@ -0,0 +1,18 @@ +title: Dynamically Reclaiming Resources +kep-number: 78 +authors: + - "@thisisprasad" +owning-sig: sig-scheduling +reviewers: + - "@alculquicondor" + - "@kerthcet" +approvers: + - "@ahg-g" + - "@alculquicondor" +editor: Prasad Saraf +creation-date: 2022-08-15 +last-updated: 2022-08-15 +status: implementable +see-also: +replaces: +superseeded-by: From 18b8c9b5f2b85f912c61292cb88ec69938744f53 Mon Sep 17 00:00:00 2001 From: Prasad Saraf Date: Thu, 22 Sep 2022 20:30:35 +0530 Subject: [PATCH 2/5] * address comments 1 --- .../README.md | 132 ++++++++++-------- .../kep.yaml | 2 - 2 files changed, 76 insertions(+), 58 deletions(-) diff --git a/enhancements/78-dynamically-reclaiming-resources/README.md b/enhancements/78-dynamically-reclaiming-resources/README.md index 62f19ef52c..b07271c7ea 100644 --- a/enhancements/78-dynamically-reclaiming-resources/README.md +++ b/enhancements/78-dynamically-reclaiming-resources/README.md @@ -8,98 +8,106 @@ - [Goals](#goals) - [Non-Goals](#non-goals) - [Proposal](#proposal) - - [Risks and Mitigations](#risks-and-mitigations) + - [Pod Successful Completion Cases](#pod-successful-completion-cases) + - [Job Parallelism Equal To 1](#job-parallelism-equal-to-1) + - [Job Parallelism Greater Than 1](#job-parallelism-greater-than-1) + - [Pod Failure Cases](#pod-failure-cases) + - [RestartPolicy of Pods as Never](#restartpolicy-of-pods-as-never) + - [RestartPolicy of Pods as OnFailure](#restartpolicy-of-pods-as-onfailure) - [API](#api) -- [Graduation Criteria](#graduation-criteria) + - [ReclaimedPodSetPods](#reclaimedpodsetpods) +- [Implementation](#implementation) - [Testing Plan](#testing-plan) - [Unit Tests](#unit-tests) - [Integration tests](#integration-tests) - - [E2E tests](#e2e-tests) - [Implementation History](#implementation-history) ## Summary -The resources of pods of a Workload are reclaimed by kueue after successful completion of the Pod. -This is beneficial for workloads which cannot proceed due to the unavailability -of the little resources for its corresponding pods and Workload does not get -admitted. +This proposal allows for the resources of Pods of a Workload to be reclaimed after successful completion of each of the Pods of a Workload. Freeing the unused resources of a Workload earlier may allow more pending Workloads to be admitted. This will allow Kueue to utilize the existing resources of the cluster much more efficiently. -Here after, *pods of a Workload* are synonymously used to refer *pods of a Job* and vice-versa because, -Job manages the Workload and the pods associated with Job are associated with the Workload. +In the remaining of this document, *Pods of a Workload* means the same as *Pods of a Job* and vice-versa. ## Motivation -Currently, the resources owned by a Job are reclaimed by kueue only when the whole Job finishes. -For jobs with multiple pods, the resources will be reclaimed only after the last Pod -of the Job finishes. This is not efficient as the pods of parallel Job may have laggards -consuming the unused resources of the jobs those are currently executing. +Currently, the resources owned by a Job are reclaimed by kueue only when the whole Job finishes. For Jobs with multiple Pods, the resources will be reclaimed only after the last Pod of the Job finishes. This is not efficient as the Pods of parallel Job may have laggards consuming the unused resources of the Jobs those are currently executing. ### Goals -- Utilize the unused resources of the Job which running in kueue. +- Utilize the unused resources of the successfully completed Pods of a running Job. ### Non-Goals -- A running Job on queue should not be preempted. -- The resources that are being used by a Job should not be allocated to any other Job. +- Preempting a Job or Pods of a Job to allocate resource to a pending Workload. +- Partially admitting a Workload. ## Proposal -Reclaim the resources of the succeeded pods of a Job as soon as it completes its execution. +Reclaiming the resources of the succeeded Pods of a running Job as soon as it completes its execution. -Add a new field to the status of the Workload to count successful completion of -pods of the Workload. The Job controller monitors the count of pods which have -successfully completed its execution. The Job controller is responsible for updating -the status of the Workload object appropriately. +We propose to add a new field `.status.ReclaimedPodSetPods` to the Workload API. Workload controller will be responsible for updating the field `.status.ReclaimedPodSetPods` according to the status of the workload's Pods. -Update our documentation to reflect the new enhancement. +`.status.ReclaimedPodSetPods` will keep the mapping between the PodSet name to the count and names of the successfully completed Pods belonging to the Workload. If a name of a Pod of a Workload is present in the field then it means that the resources associated with the completed Pods have been reclaimed. The name of the Pod is recorded to avoid reclaiming the resources of the Pod more than once. -### Risks and Mitigations +Now we will look at the Pod failure cases and how the reclaiming of resources by Kueue will happen. -There is a change in the functionality of reclaiming the cluster-queue resources. -The resources of a Job are reclaimed by kueue in an incremental manner(resources of each -Pod of a Job are reclaimed separately), as opposed to reclaiming the whole resources -of the Job after its completion. +### Pod Successful Completion Cases -The update logic to reclaim resources after a Job completion must be handled correctly. +#### Job Parallelism Equal To 1 -### API +By default, the `.spec.parallelism` of Job is equal to `1`. In this case, if the Pod completes with successful execution, the whole Job execution can be considered as a success. Hence, the resource associated with the Job will be reclaimed by Kueue after the Job completion. This functionality is present today as well and no change will be required. -A new field `CompletedPods` is added to the `WorkloadStatus` struct. +#### Job Parallelism Greater Than 1 + +Whenever a Pod of a Job successfully completes its execution, Kueue will reclaim the resources that were associated with successful Pod. The Job might be still in `Running` state as there could be other Pods of the Job that are executing. Thus, Kueue will reclaim the resources of the succeeded Pods of the Job in an incremental manner until the Job completes in either `Succeeded` or `Failed` state. Whenever the Job completes, the only remaining owned resources of the Job will be reclaimed by Kueue. + +### Pod Failure Cases + +#### RestartPolicy of Pods as Never + +Whenever a Pod of a Job fails, the Job will recreate a Pod as replacement against the failed Pod. The Job will continue this process of creating new Pods until the Job reaches its `backoffLimit`(by default the backoffLimit is `6`). The newly created Pod against the failed Pod will reuse the resources of the failed Pod. Once the Job reaches its `backoffLimit` and does not have the required successful `.spec.completions` count, the Job is termed as a `Failed` Job. When the Job is marked as failed, no more Pods would be created by the Job. So, the remaining owned resources of the Job will be reclaimed by queue. + +#### RestartPolicy of Pods as OnFailure + +When the Pods of the Job have `.spec.template.spec.restartPolicy = "OnFailure"`, the failed Pod stays on the node and the failed container within the Pod is restarted. The Pod might run indefinitely or get marked as `Failed` after the `.spec.activeDeadlineSeconds` is completed by Job if specified. In this case also, Kueue won't reclaim the resources of the Job until the Job gets completed as either `Failed` or `Succeeded`. + +Hence, as seen from the above discussed Pod failure cases, we conclude that the Pods of a Job which fail during its execution, its resources are not immediately reclaimed by Kueue. Only when the Job gets marks as `Failed`, the resources of failed Pods will be reclaimed by Kueue. + +## API + +A new field `ReclaimedPodSetPods` is added to the `.status` of Workload API. ```go // WorkloadStatus defines the observed state of Workload type WorkloadStatus struct { - // conditions hold the latest available observations of the Workload - // current state. - // +optional - // +listType=map - // +listMapKey=type - Conditions []WorkloadCondition `json:"conditions,omitempty"` - - // The number of pods which reached phase Succeeded or Failed. - // +optional - CompletedPods int32 `json:"completedPods"` + // conditions hold the latest available observations of the Workload + // current state. + // +optional + // +listType=map + // +listMapKey=type + Conditions []WorkloadCondition `json:"conditions,omitempty"` + + // ReclaimedPodSetPods are the pods (by PodSet name) which have + //completed their execution successfully and resources of the pods + //are reclaimed. + // +optional + ReclaimedPodSetPods ReclaimedPodSetPods `json:"reclaimedPodSetPods"` } -``` -#### CompletedPods - -`CompletedPods` stores the count of pods which have successfully completed its execution. +type ReclaimedPodSetPods map[string]struct{ + Count int + Pods []string +} +``` -High level execution flow: -1. The completed pods in a Job is the count of Succeeded pods => sum(succeeded). -2. Update the Workload status object if job.sum(Succeeded) > wl.Status.CompletedPods -3. Handle the update event of Workload and update Workload status in cache. -4. Update clusterQueue quota for resource flavor for requests of Pod of a Workload in cache. +#### ReclaimedPodSetPods +`.status.reclaimedPodSetPods` holds the mapping between PodSet name to the count of Pods and Pod names that have successfully completed its execution and their resources have been reclaimed by Kueue. The names of Pods are recorded to identify the Pods whose resources have been reclaimed by Kueue. If a name of a Pod already exists in the list then the resources associated with the succeeded Pod have already been reclaimed by Kueue. -## Graduation Criteria +## Implementation -* The features have been stable and reliable in the past several releases. -* Adequate documentation exists for the features. -* Test coverage of the features is acceptable. +The Workload reconciler will keep a watch(or `Watches()`) on the Pods of the Job. The Workload reconciler will process the events of the Pods and update `.status.reclaimedPodSetPods` field accordingly. ## Testing Plan @@ -107,13 +115,25 @@ Dynamically reclaiming resources enhancement has unit and integration tests. The are run regularly as a part of kueue's prow CI/CD pipeline. ### Unit Tests -Here is a list of unit tests for various modules of the feature: + +All the Kueue's core components must be covered by unit tests. Here is a list of unit tests required for the modules of the feature: + * [Cluster-Queue cache tests](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/cache/cache_test.go) * [Workload tests](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/workload/workload_test.go) ### Integration tests -* Integration tests for Job controller are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/job/job_controller_test.go) -* Integration tests for Workload controller are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/core/workload_controller_test.go) +* Kueue Job Controller + - checking the resources owned by a Job are released to the cache and clusterQueue when a Pod of the Job succeed. + - Integration tests for Job controller are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/job/job_controller_test.go). + +* Workload Controller + - A pending Workload should be admitted when enough resources are available after release of resources by the succeeded Pods of the parallel Jobs. + - Should update the `.spec.reclaimedPodSetPods` of a Workload when a Pod of a Job succeeds. + - Integration tests for Workload Controller are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/core/workload_controller_test.go). + +* Scheduler + - Checking if a Workload gets admitted when an active parallel Job releases resources of completed Pods. + - Integration tests for Scheduler are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/scheduler/scheduler_test.go). ## Implementation History diff --git a/enhancements/78-dynamically-reclaiming-resources/kep.yaml b/enhancements/78-dynamically-reclaiming-resources/kep.yaml index 8a11ed61ec..c6ca117362 100644 --- a/enhancements/78-dynamically-reclaiming-resources/kep.yaml +++ b/enhancements/78-dynamically-reclaiming-resources/kep.yaml @@ -10,8 +10,6 @@ approvers: - "@ahg-g" - "@alculquicondor" editor: Prasad Saraf -creation-date: 2022-08-15 -last-updated: 2022-08-15 status: implementable see-also: replaces: From b9e0ccada36ceaa155eb4010a10f260f00eeabbf Mon Sep 17 00:00:00 2001 From: Prasad Saraf Date: Tue, 11 Oct 2022 11:18:53 +0530 Subject: [PATCH 3/5] * address comments 2 --- .../README.md | 142 ++++++++++++++++++ .../kep.yaml | 15 ++ 2 files changed, 157 insertions(+) create mode 100644 keps/78-dynamically-reclaiming-resources/README.md create mode 100644 keps/78-dynamically-reclaiming-resources/kep.yaml diff --git a/keps/78-dynamically-reclaiming-resources/README.md b/keps/78-dynamically-reclaiming-resources/README.md new file mode 100644 index 0000000000..1bbe55d1fa --- /dev/null +++ b/keps/78-dynamically-reclaiming-resources/README.md @@ -0,0 +1,142 @@ +# Dynamically reclaim resources of Pods of a Workload + +## Table of Contents + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Pod Successful Completion Cases](#pod-successful-completion-cases) + - [Job Parallelism Equal To 1](#job-parallelism-equal-to-1) + - [Job Parallelism Greater Than 1](#job-parallelism-greater-than-1) + - [Pod Failure Cases](#pod-failure-cases) + - [RestartPolicy of Pods as Never](#restartpolicy-of-pods-as-never) + - [RestartPolicy of Pods as OnFailure](#restartpolicy-of-pods-as-onfailure) +- [API](#api) + - [ReclaimedPodSetPods](#reclaimedpodsetpods) +- [Implementation](#implementation) +- [Testing Plan](#testing-plan) + - [Unit Tests](#unit-tests) + - [Integration tests](#integration-tests) +- [Implementation History](#implementation-history) + + +## Summary + +This proposal allows Kueue to reclaim the resources of a successfully completed Pod of a Workload even before the whole Workload completes its execution. Freeing the unused resources of a Workload earlier may allow more pending Workloads to be admitted. This will allow Kueue to utilize the existing resources of the cluster much more efficiently. + +In the remaining of this document, *Pods of a Workload* means the same as *Pods of a Job* and vice-versa. + + +## Motivation + +Currently, the resources owned by a Job are reclaimed by Kueue only when the whole Job finishes. For Jobs with multiple Pods, the resources will be reclaimed only after the last Pod of the Job finishes. This is not efficient as the Pods of parallel Job may have laggards consuming the unused resources of the Jobs those are currently executing. + +### Goals + +- Utilize the unused resources of the successfully completed Pods of a running Job. + +### Non-Goals + +- Preempting a Workload or Pods of a Workload to free resources for a pending Workload. +- Partially admitting a Workload. + +## Proposal + +Reclaiming the resources of the succeeded Pods of a running Job as soon as the Pod completes its execution. + +We propose to add a new field `.status.ReclaimedPodSetPods` to the Workload API. Workload controller will be responsible for updating the field `.status.ReclaimedPodSetPods` according to the status of the Workload's Pods. + +`.status.ReclaimedPodSetPods` is a list that holds the count of successfully completed Pods belonging to a PodSet whose resources have been reclaimed by Kueue. + +Now we will look at the Pod failure cases and how the reclaiming of resources by Kueue will happen. + +### Pod Successful Completion Cases + +#### Job Parallelism Equal To 1 + +By default, the `.spec.parallelism` of Job is equal to `1`. In this case, if the Pod completes with successful execution, the whole Job execution can be considered as a success. Hence, the resource associated with the Job will be reclaimed by Kueue after the Job completion. This functionality is present today as well and no change will be required. + +#### Job Parallelism Greater Than 1 + +Whenever a Pod of a Job successfully completes its execution, Kueue will reclaim the resources that were associated with successful Pod. The Job might be still in `Running` state as there could be other Pods of the Job that are executing. Thus, Kueue will reclaim the resources of the succeeded Pods of the Job in an incremental manner until the Job completes in either `Succeeded` or `Failed` state. Whenever the Job completes, the only remaining owned resources of the Job will be reclaimed by Kueue. + +### Pod Failure Cases + +#### RestartPolicy of Pods as Never + +Whenever a Pod of a Job fails, the Job will recreate a Pod as replacement against the failed Pod. The Job will continue this process of creating new Pods until the Job reaches its `backoffLimit`(by default the backoffLimit is `6`). The newly created Pod against the failed Pod will reuse the resources of the failed Pod. Once the Job reaches its `backoffLimit` and does not have the required successful `.spec.completions` count, the Job is termed as a `Failed` Job. When the Job is marked as failed, no more Pods would be created by the Job. So, the remaining owned resources of the Job will be reclaimed by queue. + +#### RestartPolicy of Pods as OnFailure + +When the Pods of the Job have `.spec.template.spec.restartPolicy = "OnFailure"`, the failed Pod stays on the node and the failed container within the Pod is restarted. The Pod might run indefinitely or get marked as `Failed` after the `.spec.activeDeadlineSeconds` is completed by Job if specified. In this case also, Kueue won't reclaim the resources of the Job until the Job gets completed as either `Failed` or `Succeeded`. + +Hence, as seen from the above discussed Pod failure cases, we conclude that the Pods of a Job which fail during its execution, its resources are not immediately reclaimed by Kueue. Only when the Job gets marks as `Failed`, the resources of failed Pods will be reclaimed by Kueue. + +## API + +A new field `ReclaimedPodSetPods` is added to the `.status` of Workload API. + +```go +// WorkloadStatus defines the observed state of Workload +type WorkloadStatus struct { + // conditions hold the latest available observations of the Workload + // current state. + // +optional + // +listType=map + // +listMapKey=type + Conditions []WorkloadCondition `json:"conditions,omitempty"` + + // list of count of Pods of a PodSet with resources reclaimed + // +optional + ReclaimedPodSetPods []ReclaimedPodSetPods `json:"reclaimedPodSetPods"` +} + +// ReclaimedPodSetPods defines the PodSet name and count of successfully completed Pods +// belonging to the PodSet whose resources have been reclaimed by Kueue +type ReclaimedPodSetPods struct{ + Name string + Count int +} +``` + +#### ReclaimedPodSetPods +`.status.reclaimedPodSetPods` is a list where each element of the list denotes the count of Pods of the Workload w.r.t PodSet whose resources have been reclaimed by Kueue. The structure consists of two fields - `Name` denotes the name of PodSet and `Count` denotes the count of Pods belonging to the `Name` PodSet. + +## Implementation + +The workload reconciler will keep a watch(or `Watches()`) on the Job's `.status.succeeded` field. The Workload reconciler will calculate the difference between the Job's `.status.succeeded` field and Workload's `.status.reclaimedPodSetPods[i].count` field for every PodSet of the workload. With the former value being greater than the later, Kueue will reclaim the resources of the excess succeeded Pods. + +## Testing Plan + +Dynamically reclaiming resources enhancement has unit and integration tests. These tests +are run regularly as a part of Kueue's prow CI/CD pipeline. + +### Unit Tests + +All the Kueue's core components must be covered by unit tests. Here is a list of unit tests required for the modules of the feature: + +* [Cluster-Queue cache tests](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/cache/cache_test.go) +* [Workload tests](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/workload/workload_test.go) + +### Integration tests +* Kueue Job Controller + - checking the resources owned by a Job are released to the cache and clusterQueue when a Pod of the Job succeed. + - Integration tests for Job controller are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/job/job_controller_test.go). + +* Workload Controller + - A pending Workload should be admitted when enough resources are available after release of resources by the succeeded Pods of the parallel Jobs. + - Should update the `.spec.reclaimedPodSetPods` of a Workload when a Pod of a Job succeeds. + - Integration tests for Workload Controller are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/core/workload_controller_test.go). + +* Scheduler + - Checking if a Workload gets admitted when an active parallel Job releases resources of completed Pods. + - Integration tests for Scheduler are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/scheduler/scheduler_test.go). + +## Implementation History + +Dynamically Reclaiming Resources are tracked as part of [enhancement#78](https://github.com/kubernetes-sigs/kueue/issues/78). + +**TODO** - Add proposal link \ No newline at end of file diff --git a/keps/78-dynamically-reclaiming-resources/kep.yaml b/keps/78-dynamically-reclaiming-resources/kep.yaml new file mode 100644 index 0000000000..ce931ad1e5 --- /dev/null +++ b/keps/78-dynamically-reclaiming-resources/kep.yaml @@ -0,0 +1,15 @@ +title: Dynamically Reclaiming Resources +kep-number: 78 +authors: + - "@thisisprasad" +owning-sig: sig-scheduling +reviewers: + - "@alculquicondor" + - "@kerthcet" +approvers: + - "@ahg-g" + - "@alculquicondor" +status: implementable +see-also: +replaces: +superseeded-by: From 2a62fcb2c9c9c71b5247c150c585e9082dfd5d96 Mon Sep 17 00:00:00 2001 From: Prasad Saraf Date: Sat, 19 Nov 2022 19:56:18 +0530 Subject: [PATCH 4/5] Remove enhancement directory and address comments --- .../README.md | 142 ------------------ .../kep.yaml | 16 -- .../README.md | 49 +++--- 3 files changed, 22 insertions(+), 185 deletions(-) delete mode 100644 enhancements/78-dynamically-reclaiming-resources/README.md delete mode 100644 enhancements/78-dynamically-reclaiming-resources/kep.yaml diff --git a/enhancements/78-dynamically-reclaiming-resources/README.md b/enhancements/78-dynamically-reclaiming-resources/README.md deleted file mode 100644 index b07271c7ea..0000000000 --- a/enhancements/78-dynamically-reclaiming-resources/README.md +++ /dev/null @@ -1,142 +0,0 @@ -# Dynamically reclaim resources of Pods of a Workload - -## Table of Contents - - -- [Summary](#summary) -- [Motivation](#motivation) - - [Goals](#goals) - - [Non-Goals](#non-goals) -- [Proposal](#proposal) - - [Pod Successful Completion Cases](#pod-successful-completion-cases) - - [Job Parallelism Equal To 1](#job-parallelism-equal-to-1) - - [Job Parallelism Greater Than 1](#job-parallelism-greater-than-1) - - [Pod Failure Cases](#pod-failure-cases) - - [RestartPolicy of Pods as Never](#restartpolicy-of-pods-as-never) - - [RestartPolicy of Pods as OnFailure](#restartpolicy-of-pods-as-onfailure) -- [API](#api) - - [ReclaimedPodSetPods](#reclaimedpodsetpods) -- [Implementation](#implementation) -- [Testing Plan](#testing-plan) - - [Unit Tests](#unit-tests) - - [Integration tests](#integration-tests) -- [Implementation History](#implementation-history) - - -## Summary - -This proposal allows for the resources of Pods of a Workload to be reclaimed after successful completion of each of the Pods of a Workload. Freeing the unused resources of a Workload earlier may allow more pending Workloads to be admitted. This will allow Kueue to utilize the existing resources of the cluster much more efficiently. - -In the remaining of this document, *Pods of a Workload* means the same as *Pods of a Job* and vice-versa. - - -## Motivation - -Currently, the resources owned by a Job are reclaimed by kueue only when the whole Job finishes. For Jobs with multiple Pods, the resources will be reclaimed only after the last Pod of the Job finishes. This is not efficient as the Pods of parallel Job may have laggards consuming the unused resources of the Jobs those are currently executing. - -### Goals - -- Utilize the unused resources of the successfully completed Pods of a running Job. - -### Non-Goals - -- Preempting a Job or Pods of a Job to allocate resource to a pending Workload. -- Partially admitting a Workload. - -## Proposal - -Reclaiming the resources of the succeeded Pods of a running Job as soon as it completes its execution. - -We propose to add a new field `.status.ReclaimedPodSetPods` to the Workload API. Workload controller will be responsible for updating the field `.status.ReclaimedPodSetPods` according to the status of the workload's Pods. - -`.status.ReclaimedPodSetPods` will keep the mapping between the PodSet name to the count and names of the successfully completed Pods belonging to the Workload. If a name of a Pod of a Workload is present in the field then it means that the resources associated with the completed Pods have been reclaimed. The name of the Pod is recorded to avoid reclaiming the resources of the Pod more than once. - -Now we will look at the Pod failure cases and how the reclaiming of resources by Kueue will happen. - -### Pod Successful Completion Cases - -#### Job Parallelism Equal To 1 - -By default, the `.spec.parallelism` of Job is equal to `1`. In this case, if the Pod completes with successful execution, the whole Job execution can be considered as a success. Hence, the resource associated with the Job will be reclaimed by Kueue after the Job completion. This functionality is present today as well and no change will be required. - -#### Job Parallelism Greater Than 1 - -Whenever a Pod of a Job successfully completes its execution, Kueue will reclaim the resources that were associated with successful Pod. The Job might be still in `Running` state as there could be other Pods of the Job that are executing. Thus, Kueue will reclaim the resources of the succeeded Pods of the Job in an incremental manner until the Job completes in either `Succeeded` or `Failed` state. Whenever the Job completes, the only remaining owned resources of the Job will be reclaimed by Kueue. - -### Pod Failure Cases - -#### RestartPolicy of Pods as Never - -Whenever a Pod of a Job fails, the Job will recreate a Pod as replacement against the failed Pod. The Job will continue this process of creating new Pods until the Job reaches its `backoffLimit`(by default the backoffLimit is `6`). The newly created Pod against the failed Pod will reuse the resources of the failed Pod. Once the Job reaches its `backoffLimit` and does not have the required successful `.spec.completions` count, the Job is termed as a `Failed` Job. When the Job is marked as failed, no more Pods would be created by the Job. So, the remaining owned resources of the Job will be reclaimed by queue. - -#### RestartPolicy of Pods as OnFailure - -When the Pods of the Job have `.spec.template.spec.restartPolicy = "OnFailure"`, the failed Pod stays on the node and the failed container within the Pod is restarted. The Pod might run indefinitely or get marked as `Failed` after the `.spec.activeDeadlineSeconds` is completed by Job if specified. In this case also, Kueue won't reclaim the resources of the Job until the Job gets completed as either `Failed` or `Succeeded`. - -Hence, as seen from the above discussed Pod failure cases, we conclude that the Pods of a Job which fail during its execution, its resources are not immediately reclaimed by Kueue. Only when the Job gets marks as `Failed`, the resources of failed Pods will be reclaimed by Kueue. - -## API - -A new field `ReclaimedPodSetPods` is added to the `.status` of Workload API. - -```go -// WorkloadStatus defines the observed state of Workload -type WorkloadStatus struct { - // conditions hold the latest available observations of the Workload - // current state. - // +optional - // +listType=map - // +listMapKey=type - Conditions []WorkloadCondition `json:"conditions,omitempty"` - - // ReclaimedPodSetPods are the pods (by PodSet name) which have - //completed their execution successfully and resources of the pods - //are reclaimed. - // +optional - ReclaimedPodSetPods ReclaimedPodSetPods `json:"reclaimedPodSetPods"` -} - -type ReclaimedPodSetPods map[string]struct{ - Count int - Pods []string -} -``` - -#### ReclaimedPodSetPods -`.status.reclaimedPodSetPods` holds the mapping between PodSet name to the count of Pods and Pod names that have successfully completed its execution and their resources have been reclaimed by Kueue. The names of Pods are recorded to identify the Pods whose resources have been reclaimed by Kueue. If a name of a Pod already exists in the list then the resources associated with the succeeded Pod have already been reclaimed by Kueue. - -## Implementation - -The Workload reconciler will keep a watch(or `Watches()`) on the Pods of the Job. The Workload reconciler will process the events of the Pods and update `.status.reclaimedPodSetPods` field accordingly. - -## Testing Plan - -Dynamically reclaiming resources enhancement has unit and integration tests. These tests -are run regularly as a part of kueue's prow CI/CD pipeline. - -### Unit Tests - -All the Kueue's core components must be covered by unit tests. Here is a list of unit tests required for the modules of the feature: - -* [Cluster-Queue cache tests](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/cache/cache_test.go) -* [Workload tests](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/workload/workload_test.go) - -### Integration tests -* Kueue Job Controller - - checking the resources owned by a Job are released to the cache and clusterQueue when a Pod of the Job succeed. - - Integration tests for Job controller are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/job/job_controller_test.go). - -* Workload Controller - - A pending Workload should be admitted when enough resources are available after release of resources by the succeeded Pods of the parallel Jobs. - - Should update the `.spec.reclaimedPodSetPods` of a Workload when a Pod of a Job succeeds. - - Integration tests for Workload Controller are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/core/workload_controller_test.go). - -* Scheduler - - Checking if a Workload gets admitted when an active parallel Job releases resources of completed Pods. - - Integration tests for Scheduler are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/scheduler/scheduler_test.go). - -## Implementation History - -Dynamically Reclaiming Resources are tracked as part of [enhancement#78](https://github.com/kubernetes-sigs/kueue/issues/78). - -**TODO** - Add proposal link \ No newline at end of file diff --git a/enhancements/78-dynamically-reclaiming-resources/kep.yaml b/enhancements/78-dynamically-reclaiming-resources/kep.yaml deleted file mode 100644 index c6ca117362..0000000000 --- a/enhancements/78-dynamically-reclaiming-resources/kep.yaml +++ /dev/null @@ -1,16 +0,0 @@ -title: Dynamically Reclaiming Resources -kep-number: 78 -authors: - - "@thisisprasad" -owning-sig: sig-scheduling -reviewers: - - "@alculquicondor" - - "@kerthcet" -approvers: - - "@ahg-g" - - "@alculquicondor" -editor: Prasad Saraf -status: implementable -see-also: -replaces: -superseeded-by: diff --git a/keps/78-dynamically-reclaiming-resources/README.md b/keps/78-dynamically-reclaiming-resources/README.md index 1bbe55d1fa..21c5f4c6b5 100644 --- a/keps/78-dynamically-reclaiming-resources/README.md +++ b/keps/78-dynamically-reclaiming-resources/README.md @@ -1,4 +1,4 @@ -# Dynamically reclaim resources of Pods of a Workload +# KEP-78: Dynamically reclaim resources of Pods of a Workload ## Table of Contents @@ -9,13 +9,12 @@ - [Non-Goals](#non-goals) - [Proposal](#proposal) - [Pod Successful Completion Cases](#pod-successful-completion-cases) - - [Job Parallelism Equal To 1](#job-parallelism-equal-to-1) - - [Job Parallelism Greater Than 1](#job-parallelism-greater-than-1) + - [Case N >= M](#case-n--m) + - [Case M > N](#case-m--n) - [Pod Failure Cases](#pod-failure-cases) - [RestartPolicy of Pods as Never](#restartpolicy-of-pods-as-never) - [RestartPolicy of Pods as OnFailure](#restartpolicy-of-pods-as-onfailure) - [API](#api) - - [ReclaimedPodSetPods](#reclaimedpodsetpods) - [Implementation](#implementation) - [Testing Plan](#testing-plan) - [Unit Tests](#unit-tests) @@ -47,21 +46,23 @@ Currently, the resources owned by a Job are reclaimed by Kueue only when the who Reclaiming the resources of the succeeded Pods of a running Job as soon as the Pod completes its execution. -We propose to add a new field `.status.ReclaimedPodSetPods` to the Workload API. Workload controller will be responsible for updating the field `.status.ReclaimedPodSetPods` according to the status of the Workload's Pods. +We propose to add a new field `.status.ReclaimedPodSets` to the Workload API. `.status.ReclaimedPodSets` is a list that holds the count of successfully completed Pods belonging to a PodSet whose resources have been reclaimed by Kueue. -`.status.ReclaimedPodSetPods` is a list that holds the count of successfully completed Pods belonging to a PodSet whose resources have been reclaimed by Kueue. - -Now we will look at the Pod failure cases and how the reclaiming of resources by Kueue will happen. +Now we will look at the Pod successful completion and failure cases and how Kueue will reclaim the resources in each of the cases. ### Pod Successful Completion Cases -#### Job Parallelism Equal To 1 +Reclaiming the resources of a successful Pod of a Job depends on two parameters - remaining completions of a Job(M) and Job's parallelism(N). Depending on the values of **M** and **N**, we will look at the different cases how the resources will be reclaimed by Kueue. + +Please note here that **M** here refers to the remaining successful completions of a Job and not the Job's `.spec.completions` field. **M** is subject to change during the lifetime of the Job. One way to derive the value of **M** would be to calculate the difference of `.spec.completions` and `.status.succeeded` values of a Job. -By default, the `.spec.parallelism` of Job is equal to `1`. In this case, if the Pod completes with successful execution, the whole Job execution can be considered as a success. Hence, the resource associated with the Job will be reclaimed by Kueue after the Job completion. This functionality is present today as well and no change will be required. +#### Case N >= M +If a Job's parallelism is greater or equal to its remaining completions, then for every successful completion of Pod of a Job, Kueue will reclaim the resources associated with the successful Pod. -#### Job Parallelism Greater Than 1 +#### Case M > N +When the remaining completions of a Job are greater than the Job's parallelism, then for every successfully completed Pod of a Job, the resources associated with the Pod won't be reclaimed by Kueue. This is because, the resource requirement of the Job still remains the same. The Job has to create a new Pod as a replacement against the successfully completed Pod. -Whenever a Pod of a Job successfully completes its execution, Kueue will reclaim the resources that were associated with successful Pod. The Job might be still in `Running` state as there could be other Pods of the Job that are executing. Thus, Kueue will reclaim the resources of the succeeded Pods of the Job in an incremental manner until the Job completes in either `Succeeded` or `Failed` state. Whenever the Job completes, the only remaining owned resources of the Job will be reclaimed by Kueue. +The Job will be able to reclaim the resources of Pods when it satisfies the case `N >= M`. A Job which proceeds further in its execution with case `M > N` will get converted to a problem of case `N >= M` because, the value **M** will decrease with successful Pod completions. Hence, the process to reclaim resources of a Pod of a Job will be same as mentioned for the case `N >= M` ### Pod Failure Cases @@ -77,7 +78,7 @@ Hence, as seen from the above discussed Pod failure cases, we conclude that the ## API -A new field `ReclaimedPodSetPods` is added to the `.status` of Workload API. +A new field `ReclaimedPodSets` is added to the `.status` of Workload API. ```go // WorkloadStatus defines the observed state of Workload @@ -91,23 +92,20 @@ type WorkloadStatus struct { // list of count of Pods of a PodSet with resources reclaimed // +optional - ReclaimedPodSetPods []ReclaimedPodSetPods `json:"reclaimedPodSetPods"` + ReclaimedPodSets []ReclaimedPodSetPods `json:"reclaimedPodSets"` } -// ReclaimedPodSetPods defines the PodSet name and count of successfully completed Pods -// belonging to the PodSet whose resources have been reclaimed by Kueue -type ReclaimedPodSetPods struct{ +// ReclaimedPodSetPod defines the PodSet name and count of successfully completed Pods +// belonging to the PodSet whose resources can be reclaimed. +type ReclaimedPodSet struct{ Name string Count int } ``` -#### ReclaimedPodSetPods -`.status.reclaimedPodSetPods` is a list where each element of the list denotes the count of Pods of the Workload w.r.t PodSet whose resources have been reclaimed by Kueue. The structure consists of two fields - `Name` denotes the name of PodSet and `Count` denotes the count of Pods belonging to the `Name` PodSet. - ## Implementation -The workload reconciler will keep a watch(or `Watches()`) on the Job's `.status.succeeded` field. The Workload reconciler will calculate the difference between the Job's `.status.succeeded` field and Workload's `.status.reclaimedPodSetPods[i].count` field for every PodSet of the workload. With the former value being greater than the later, Kueue will reclaim the resources of the excess succeeded Pods. +Kueue's job reconciler will compare the Job's `.status.Succeeded` field and the Workload's `.status.reclaimedPodSets[i].count` field value. If the former value is greater than the later then, Kueue's Job reconciler will update the Workload object's `.status`. Workload reconciler will catch the update event of Kueue Job reconciler and release the resources of the newly succeeded Pods to the ClusterQueue depending upon the case the Job satisfies discussed in the section of [successful Pod completion](#pod-successful-completion-cases). ## Testing Plan @@ -123,12 +121,11 @@ All the Kueue's core components must be covered by unit tests. Here is a list o ### Integration tests * Kueue Job Controller - - checking the resources owned by a Job are released to the cache and clusterQueue when a Pod of the Job succeed. + - Checking the resources owned by a Job are released to the cache and clusterQueue when a Pod of the Job succeed. - Integration tests for Job controller are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/job/job_controller_test.go). * Workload Controller - - A pending Workload should be admitted when enough resources are available after release of resources by the succeeded Pods of the parallel Jobs. - - Should update the `.spec.reclaimedPodSetPods` of a Workload when a Pod of a Job succeeds. + - Should update the `.spec.reclaimedPodSets` of a Workload when a Pod of a Job succeeds. - Integration tests for Workload Controller are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/core/workload_controller_test.go). * Scheduler @@ -137,6 +134,4 @@ All the Kueue's core components must be covered by unit tests. Here is a list o ## Implementation History -Dynamically Reclaiming Resources are tracked as part of [enhancement#78](https://github.com/kubernetes-sigs/kueue/issues/78). - -**TODO** - Add proposal link \ No newline at end of file +Dynamically Reclaiming Resources are tracked as part of [enhancement#78](https://github.com/kubernetes-sigs/kueue/issues/78). \ No newline at end of file From beaceb8166fe94a354fb82439c4513183965bfd3 Mon Sep 17 00:00:00 2001 From: Traian Schiau Date: Wed, 3 May 2023 17:58:59 +0300 Subject: [PATCH 5/5] Adapt to the current state of Kueue --- .../README.md | 151 ++++++++++-------- .../kep.yaml | 1 + 2 files changed, 89 insertions(+), 63 deletions(-) diff --git a/keps/78-dynamically-reclaiming-resources/README.md b/keps/78-dynamically-reclaiming-resources/README.md index 21c5f4c6b5..ae618b7c7d 100644 --- a/keps/78-dynamically-reclaiming-resources/README.md +++ b/keps/78-dynamically-reclaiming-resources/README.md @@ -8,15 +8,18 @@ - [Goals](#goals) - [Non-Goals](#non-goals) - [Proposal](#proposal) - - [Pod Successful Completion Cases](#pod-successful-completion-cases) - - [Case N >= M](#case-n--m) - - [Case M > N](#case-m--n) - - [Pod Failure Cases](#pod-failure-cases) - - [RestartPolicy of Pods as Never](#restartpolicy-of-pods-as-never) - - [RestartPolicy of Pods as OnFailure](#restartpolicy-of-pods-as-onfailure) -- [API](#api) + - [Pod Completion Accounting](#pod-completion-accounting) + - [Reference design for (batch/Job)](#reference-design-for-) + - [To consider](#to-consider) + - [API](#api) - [Implementation](#implementation) + - [Workload](#workload) + - [API](#api-1) + - [pkg/workload](#) + - [Jobframework](#jobframework) + - [Batch/Job](#batchjob) - [Testing Plan](#testing-plan) + - [NonRegression](#nonregression) - [Unit Tests](#unit-tests) - [Integration tests](#integration-tests) - [Implementation History](#implementation-history) @@ -26,12 +29,16 @@ This proposal allows Kueue to reclaim the resources of a successfully completed Pod of a Workload even before the whole Workload completes its execution. Freeing the unused resources of a Workload earlier may allow more pending Workloads to be admitted. This will allow Kueue to utilize the existing resources of the cluster much more efficiently. -In the remaining of this document, *Pods of a Workload* means the same as *Pods of a Job* and vice-versa. +In the remaining of this document: +1. *Job* refers to any kind job supported by Kueue including `batch/job`, `MPIJob`, `RayJob`, etc. +2. *Pods of a Workload* means the same as *Pods of a Job* and vice-versa. ## Motivation -Currently, the resources owned by a Job are reclaimed by Kueue only when the whole Job finishes. For Jobs with multiple Pods, the resources will be reclaimed only after the last Pod of the Job finishes. This is not efficient as the Pods of parallel Job may have laggards consuming the unused resources of the Jobs those are currently executing. +Currently, the quota assigned to a Job is reclaimed by Kueue only when the whole Job finishes. +For Jobs with multiple Pods, the resources will be reclaimed only after the last Pod of the Job finishes. +This is not efficient as the Jobs might have different needs during execution, for instance the needs of a `batch/job` having the `parallelism` equal to `completions` will decrease with every Pod finishing it's execution. ### Goals @@ -41,97 +48,115 @@ Currently, the resources owned by a Job are reclaimed by Kueue only when the who - Preempting a Workload or Pods of a Workload to free resources for a pending Workload. - Partially admitting a Workload. +- Monitor the Job's pod execution. ## Proposal Reclaiming the resources of the succeeded Pods of a running Job as soon as the Pod completes its execution. -We propose to add a new field `.status.ReclaimedPodSets` to the Workload API. `.status.ReclaimedPodSets` is a list that holds the count of successfully completed Pods belonging to a PodSet whose resources have been reclaimed by Kueue. +We propose to add a new field `.status.ReclaimablePods` to the Workload API. `.status.ReclaimablePods` is a list that holds the count of Pods belonging to a PodSet whose resources are no longer needed and could be reclaimed by Kueue. -Now we will look at the Pod successful completion and failure cases and how Kueue will reclaim the resources in each of the cases. +### Pod Completion Accounting -### Pod Successful Completion Cases +Since the ability of actively monitoring pods execution of a job is not in the scope of the core Kueue implementation, the number of pods for which the resources are no longer needed for each PodSet should be reported by each framework specific `GenericJob` implementation. -Reclaiming the resources of a successful Pod of a Job depends on two parameters - remaining completions of a Job(M) and Job's parallelism(N). Depending on the values of **M** and **N**, we will look at the different cases how the resources will be reclaimed by Kueue. +For this purpose the `GenericJob` interface should be changed to add an additional method able to report this -Please note here that **M** here refers to the remaining successful completions of a Job and not the Job's `.spec.completions` field. **M** is subject to change during the lifetime of the Job. One way to derive the value of **M** would be to calculate the difference of `.spec.completions` and `.status.succeeded` values of a Job. - -#### Case N >= M -If a Job's parallelism is greater or equal to its remaining completions, then for every successful completion of Pod of a Job, Kueue will reclaim the resources associated with the successful Pod. - -#### Case M > N -When the remaining completions of a Job are greater than the Job's parallelism, then for every successfully completed Pod of a Job, the resources associated with the Pod won't be reclaimed by Kueue. This is because, the resource requirement of the Job still remains the same. The Job has to create a new Pod as a replacement against the successfully completed Pod. - -The Job will be able to reclaim the resources of Pods when it satisfies the case `N >= M`. A Job which proceeds further in its execution with case `M > N` will get converted to a problem of case `N >= M` because, the value **M** will decrease with successful Pod completions. Hence, the process to reclaim resources of a Pod of a Job will be same as mentioned for the case `N >= M` - -### Pod Failure Cases +```go +type GenericJob interface { + // ... -#### RestartPolicy of Pods as Never + // Get reclaimable pods. + ReclaimablePods() []ReclaimablePod +} +``` -Whenever a Pod of a Job fails, the Job will recreate a Pod as replacement against the failed Pod. The Job will continue this process of creating new Pods until the Job reaches its `backoffLimit`(by default the backoffLimit is `6`). The newly created Pod against the failed Pod will reuse the resources of the failed Pod. Once the Job reaches its `backoffLimit` and does not have the required successful `.spec.completions` count, the Job is termed as a `Failed` Job. When the Job is marked as failed, no more Pods would be created by the Job. So, the remaining owned resources of the Job will be reclaimed by queue. +#### Reference design for (`batch/Job`) -#### RestartPolicy of Pods as OnFailure +Having a job defined with **P** `parallelism` and **C** `completions`, and **n** number of completed pod executions, +the expected reclaimablePods should be: -When the Pods of the Job have `.spec.template.spec.restartPolicy = "OnFailure"`, the failed Pod stays on the node and the failed container within the Pod is restarted. The Pod might run indefinitely or get marked as `Failed` after the `.spec.activeDeadlineSeconds` is completed by Job if specified. In this case also, Kueue won't reclaim the resources of the Job until the Job gets completed as either `Failed` or `Succeeded`. +```go +[]ReclaimablePod{ + { + Name: "main", + Count: P - min(P,(C-n)), + } +} +``` -Hence, as seen from the above discussed Pod failure cases, we conclude that the Pods of a Job which fail during its execution, its resources are not immediately reclaimed by Kueue. Only when the Job gets marks as `Failed`, the resources of failed Pods will be reclaimed by Kueue. +##### To consider +According to [kubernetes/enhancements](https://github.com/kubernetes/enhancements) the algorithm presented above might need to be reworked in order to account for: +- [KEP-3939](https://github.com/kubernetes/enhancements/pull/3940) which adds a new field `terminating` to account for terminating pods, depending on `spec.RecreatePodsWhen`, when reclaimablePods are computed the new fiels needs to be taken into account. +- [KEP-3850](https://github.com/kubernetes/enhancements/pull/3967) which adds the ability for an index to fail. If an index fails, the resources previously reserved for it are no longer needed. -## API +### API -A new field `ReclaimedPodSets` is added to the `.status` of Workload API. +A new field `ReclaimablePods` is added to the `.status` of Workload API. ```go -// WorkloadStatus defines the observed state of Workload +// WorkloadStatus defines the observed state of Workload. type WorkloadStatus struct { - // conditions hold the latest available observations of the Workload - // current state. - // +optional - // +listType=map - // +listMapKey=type - Conditions []WorkloadCondition `json:"conditions,omitempty"` - // list of count of Pods of a PodSet with resources reclaimed + // ... + + + // reclaimablePods keeps track of the number pods within a podset for which + // the resource reservation is no longer needed. // +optional - ReclaimedPodSets []ReclaimedPodSetPods `json:"reclaimedPodSets"` + ReclaimablePods []ReclaimablePod `json:"reclaimablePods,omitempty"` } -// ReclaimedPodSetPod defines the PodSet name and count of successfully completed Pods -// belonging to the PodSet whose resources can be reclaimed. -type ReclaimedPodSet struct{ - Name string - Count int +type ReclaimablePod struct { + // name is the PodSet name. + Name string `json:"name"` + + // count is the number of pods for which the requested resources are no longer needed. + Count int32 `json:"count"` } ``` ## Implementation -Kueue's job reconciler will compare the Job's `.status.Succeeded` field and the Workload's `.status.reclaimedPodSets[i].count` field value. If the former value is greater than the later then, Kueue's Job reconciler will update the Workload object's `.status`. Workload reconciler will catch the update event of Kueue Job reconciler and release the resources of the newly succeeded Pods to the ClusterQueue depending upon the case the Job satisfies discussed in the section of [successful Pod completion](#pod-successful-completion-cases). +### Workload +#### API -## Testing Plan +- Add the new field in the workload's status. +- Validate the data in `status.ReclaimablePods`: + 1. The names must be found in the `PodSets`. + 2. The cont should never exceed the `PodSets` count. + 3. The cont should not decrease if the workload is admitted. -Dynamically reclaiming resources enhancement has unit and integration tests. These tests -are run regularly as a part of Kueue's prow CI/CD pipeline. -### Unit Tests +#### `pkg/workload` -All the Kueue's core components must be covered by unit tests. Here is a list of unit tests required for the modules of the feature: +Rework the way `Info.TotalRequests` in computed in order to take the `ReclaimablePods` into account. -* [Cluster-Queue cache tests](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/cache/cache_test.go) -* [Workload tests](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/workload/workload_test.go) +### Jobframework -### Integration tests -* Kueue Job Controller - - Checking the resources owned by a Job are released to the cache and clusterQueue when a Pod of the Job succeed. - - Integration tests for Job controller are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/job/job_controller_test.go). +Adapt the `GenericJob` interface, and ensure that the `ReclaimablePods` information provided is synced with it's associated workload status. + +### Batch/Job + +Adapt it's `GenericJob` implementation to the new interface. + + +## Testing Plan + +### NonRegression +The new implementation should not impact any of the existing unit, integration or e2e tests. A workload that has no `ReclaimablePods` populated should behave the same as it dose prior to this implementation. + +### Unit Tests -* Workload Controller - - Should update the `.spec.reclaimedPodSets` of a Workload when a Pod of a Job succeeds. - - Integration tests for Workload Controller are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/core/workload_controller_test.go). +All the Kueue's core components must be covered by unit tests. +### Integration tests * Scheduler - - Checking if a Workload gets admitted when an active parallel Job releases resources of completed Pods. - - Integration tests for Scheduler are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/scheduler/scheduler_test.go). + - Checking if a Workload gets admitted when an admitted Workload releases a part of it's assigned resources. + +* Kueue Job Controller (Optional) + - Checking the resources owned by a Job are released to the cache and clusterQueue when a Pod of the Job succeed. ## Implementation History -Dynamically Reclaiming Resources are tracked as part of [enhancement#78](https://github.com/kubernetes-sigs/kueue/issues/78). \ No newline at end of file +Dynamically Reclaiming Resources are tracked as part of [enhancement#78](https://github.com/kubernetes-sigs/kueue/issues/78). diff --git a/keps/78-dynamically-reclaiming-resources/kep.yaml b/keps/78-dynamically-reclaiming-resources/kep.yaml index ce931ad1e5..606f3998de 100644 --- a/keps/78-dynamically-reclaiming-resources/kep.yaml +++ b/keps/78-dynamically-reclaiming-resources/kep.yaml @@ -2,6 +2,7 @@ title: Dynamically Reclaiming Resources kep-number: 78 authors: - "@thisisprasad" + - "@trasc" owning-sig: sig-scheduling reviewers: - "@alculquicondor"