Adapt to the current state of Kueue

kubernetes-sigs · May 8, 2023 · ee74fa8 · ee74fa8
1 parent 2a62fcb
commit ee74fa8
Show file tree

Hide file tree

Showing 2 changed files with 95 additions and 61 deletions.
diff --git a/keps/78-dynamically-reclaiming-resources/README.md b/keps/78-dynamically-reclaiming-resources/README.md
@@ -8,15 +8,18 @@
   - [Goals](#goals)
   - [Non-Goals](#non-goals)
 - [Proposal](#proposal)
-  - [Pod Successful Completion Cases](#pod-successful-completion-cases)
-    - [Case N &gt;= M](#case-n--m)
-    - [Case M &gt; N](#case-m--n)
-  - [Pod Failure Cases](#pod-failure-cases)
-    - [RestartPolicy of Pods as Never](#restartpolicy-of-pods-as-never)
-    - [RestartPolicy of Pods as OnFailure](#restartpolicy-of-pods-as-onfailure)
-- [API](#api)
+  - [Pod Completion Accounting](#pod-completion-accounting)
+    - [Example (<code>batch/job</code>)](#example-)
+  - [API](#api)
 - [Implementation](#implementation)
+  - [Workload](#workload)
+    - [API](#api-1)
+    - [<code>pkg/workload</code>](#)
+    - [Controller](#controller)
+  - [Jobframwork](#jobframwork)
+  - [Batch/Job](#batchjob)
 - [Testing Plan](#testing-plan)
+  - [NonRegression](#nonregression)
   - [Unit Tests](#unit-tests)
   - [Integration tests](#integration-tests)
 - [Implementation History](#implementation-history)
@@ -26,12 +29,16 @@
 
 This proposal allows Kueue to reclaim the resources of a successfully completed Pod of a Workload even before the whole Workload completes its execution. Freeing the unused resources of a Workload earlier may allow more pending Workloads to be admitted. This will allow Kueue to utilize the existing resources of the cluster much more efficiently.
 
-In the remaining of this document, *Pods of a Workload* means the same as *Pods of a Job* and vice-versa.
+In the remaining of this document:
+1. *Job* refers to any kind job supported by Kueue including `batch/job`, `MPIJob`, `RayJob`, etc.
+2. *Pods of a Workload* means the same as *Pods of a Job* and vice-versa.
 
 
 ## Motivation
 
-Currently, the resources owned by a Job are reclaimed by Kueue only when the whole Job finishes. For Jobs with multiple Pods, the resources will be reclaimed only after the last Pod of the Job finishes. This is not efficient as the Pods of parallel Job may have laggards consuming the unused resources of the Jobs those are currently executing.
+Currently, the quota assigned to a Job is reclaimed by Kueue only when the whole Job finishes.
+For Jobs with multiple Pods, the resources will be reclaimed only after the last Pod of the Job finishes.
+This is not efficient as the Jobs might have different needs during execution, for instance the needs of a `batch/job` having the `parallelism` equal to `completions` will decrease with every Pod finishing it's execution.
 
 ### Goals
 
@@ -41,97 +48,123 @@ Currently, the resources owned by a Job are reclaimed by Kueue only when the who
 
 - Preempting a Workload or Pods of a Workload to free resources for a pending Workload.
 - Partially admitting a Workload.
+- Monitor the Job's pod execution.
 
 ## Proposal
 
 Reclaiming the resources of the succeeded Pods of a running Job as soon as the Pod completes its execution.
 
-We propose to add a new field `.status.ReclaimedPodSets` to the Workload API. `.status.ReclaimedPodSets` is a list that holds the count of successfully completed Pods belonging to a PodSet whose resources have been reclaimed by Kueue.
+We propose to add a new field `.status.ReclaimablePods` to the Workload API. `.status.ReclaimablePods` is a list that holds the count of Pods belonging to a PodSet whose resources are no longer needed and could be reclaimed by Kueue.
 
-Now we will look at the Pod successful completion and failure cases and how Kueue will reclaim the resources in each of the cases.
+### Pod Completion Accounting
 
-### Pod Successful Completion Cases
+Since the ability of actively monitoring pods execution of a job is not in the scope of the core Kueue implementation, the number of pods for which the resources are no longer needed for each PodSet should be reported by each framework specific `GenericJob` implementation.
 
-Reclaiming the resources of a successful Pod of a Job depends on two parameters - remaining completions of a Job(M) and Job's parallelism(N). Depending on the values of **M** and **N**, we will look at the different cases how the resources will be reclaimed by Kueue.
+For this purpose the `GenericJob` interface should be changed to either:
 
-Please note here that **M** here refers to the remaining successful completions of a Job and not the Job's `.spec.completions` field. **M** is subject to change during the lifetime of the Job. One way to derive the value of **M** would be to calculate the difference of `.spec.completions` and `.status.succeeded` values of a Job.
+1. Add an additional method able to report this
 
-#### Case N >= M
-If a Job's parallelism is greater or equal to its remaining completions, then for every successful completion of Pod of a Job, Kueue will reclaim the resources associated with the successful Pod.
-
-#### Case M > N
-When the remaining completions of a Job are greater than the Job's parallelism, then for every successfully completed Pod of a Job, the resources associated with the Pod won't be reclaimed by Kueue. This is because, the resource requirement of the Job still remains the same. The Job has to create a new Pod as a replacement against the successfully completed Pod.
-
-The Job will be able to reclaim the resources of Pods when it satisfies the case `N >= M`. A Job which proceeds further in its execution with case `M > N` will get converted to a problem of case `N >= M` because, the value **M** will decrease with successful Pod completions. Hence, the process to reclaim resources of a Pod of a Job will be same as mentioned for the case `N >= M`
+```go
+type GenericJob interface {
+    // ...
 
-### Pod Failure Cases
+    // Get reclaimable pods.
+    ReclaimablePods() []ReclaimablePod
+}
+```
 
-#### RestartPolicy of Pods as Never
+2. Modify the current `Finished` method
 
-Whenever a Pod of a Job fails, the Job will recreate a Pod as replacement against the failed Pod. The Job will continue this process of creating new Pods until the Job reaches its `backoffLimit`(by default the backoffLimit is `6`). The newly created Pod against the failed Pod will reuse the resources of the failed Pod. Once the Job reaches its `backoffLimit` and does not have the required successful `.spec.completions` count, the Job is termed as a `Failed` Job. When the Job is marked as failed, no more Pods would be created by the Job. So, the remaining owned resources of the Job will be reclaimed by queue. 
+```go
+type GenericJob interface {
+    // ...
 
-#### RestartPolicy of Pods as OnFailure
+    // Get reclaimable pods.
+    Finished() (condition metav1.Condition, reclaimablePods map[string]int)
+}
+```
 
-When the Pods of the Job have `.spec.template.spec.restartPolicy = "OnFailure"`, the failed Pod stays on the node and the failed container within the Pod is restarted. The Pod might run indefinitely or get marked as `Failed` after the `.spec.activeDeadlineSeconds` is completed by Job if specified. In this case also, Kueue won't reclaim the resources of the Job until the Job gets completed as either `Failed` or `Succeeded`.
+#### Reference design for (`batch/Job`)
 
-Hence, as seen from the above discussed Pod failure cases, we conclude that the Pods of a Job which fail during its execution, its resources are not immediately reclaimed by Kueue. Only when the Job gets marks as `Failed`, the resources of failed Pods will be reclaimed by Kueue.
+Having a job defined with **P** `parallelism` and **C** `completions`, and **n** number of completed pod executions,
+the expected result should be:
 
-## API
+```go
+map[string]int{
+    "main": P - min(P,(C-n))
+}
+```
+### API
 
-A new field `ReclaimedPodSets` is added to the `.status` of Workload API.
+A new field `ReclaimablePods` is added to the `.status` of Workload API.
 
 ```go
-// WorkloadStatus defines the observed state of Workload
+// WorkloadStatus defines the observed state of Workload.
 type WorkloadStatus struct {
-    // conditions hold the latest available observations of the Workload
-    // current state.
-    // +optional
-    // +listType=map
-    // +listMapKey=type
-    Conditions []WorkloadCondition `json:"conditions,omitempty"`
 
-    // list of count of Pods of a PodSet with resources reclaimed
+    // ...
+
+
+    // reclaimablePods keeps track of the number pods within a podset for which
+    // the resource reservation is no longer needed.
     // +optional
-    ReclaimedPodSets []ReclaimedPodSetPods `json:"reclaimedPodSets"`
+    ReclaimablePods []ReclaimablePod `json:"reclaimablePods,omitempty"`
 }
 
-// ReclaimedPodSetPod defines the PodSet name and count of successfully completed Pods 
-// belonging to the PodSet whose resources can be reclaimed.
-type ReclaimedPodSet struct{
-    Name string
-    Count int
+type ReclaimablePod struct {
+    // name is the PodSet name.
+    Name string `json:"name"`
+
+    // count is the number of pods for which the requested resources are no longer needed.
+    Count int32 `json:"count"`
 }
 ```
 
 ## Implementation
 
-Kueue's job reconciler will compare the Job's `.status.Succeeded` field and the Workload's `.status.reclaimedPodSets[i].count` field value. If the former value is greater than the later then, Kueue's Job reconciler will update the Workload object's `.status`. Workload reconciler will catch the update event of Kueue Job reconciler and release the resources of the newly succeeded Pods to the ClusterQueue depending upon the case the Job satisfies discussed in the section of [successful Pod completion](#pod-successful-completion-cases).
+### Workload
+#### API
 
-## Testing Plan
+- Add the new field in the workload's status.
+- Validate the data in `status.ReclaimablePods`:
+  1. The names must be found in the `PodSets`.
+  2. The cont should never exceed the `PodSets` count.
+  3. The cont should not decrease if the workload is admitted.
 
-Dynamically reclaiming resources enhancement has unit and integration tests. These tests
-are run regularly as a part of Kueue's prow CI/CD pipeline.
 
-### Unit Tests
+#### `pkg/workload`
 
-All the Kueue's core components must be covered by unit tests.  Here is a list of unit tests required for the modules of the feature:
+Rework the way `Info.TotalRequests` in computed in order to take the `ReclaimablePods` into account.
 
-* [Cluster-Queue cache tests](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/cache/cache_test.go)
-* [Workload tests](https://github.com/kubernetes-sigs/kueue/blob/main/pkg/workload/workload_test.go)
+#### Controller
+
+Should update the `status.Admission` when `status.ReclaimablePods` changes.
+
+### Jobframwork
+
+Adapt the `GenericJob` interface, and ensure that the `ReclaimablePods` information provided is synced with it's associated workload status.
+
+### Batch/Job
+
+Adapt it's `GenericJob` implementation to the new interface.
 
-### Integration tests
-* Kueue Job Controller
-  - Checking the resources owned by a Job are released to the cache and clusterQueue when a Pod of the Job succeed.
-  - Integration tests for Job controller are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/job/job_controller_test.go).
 
-* Workload Controller
-  - Should update the `.spec.reclaimedPodSets` of a Workload when a Pod of a Job succeeds.
-  - Integration tests for Workload Controller are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/core/workload_controller_test.go).
+## Testing Plan
+
+### NonRegression
+The new implementation should not impact any of the existing unit, integration or e2e tests. A workload that has no `ReclaimablePods` populated should behave the same as it dose prior to this implementation.
+
+### Unit Tests
 
+All the Kueue's core components must be covered by unit tests.
+
+### Integration tests
 * Scheduler
-  - Checking if a Workload gets admitted when an active parallel Job releases resources of completed Pods.
-  - Integration tests for Scheduler are [found here](https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/scheduler/scheduler_test.go).
+  - Checking if a Workload gets admitted when an admitted Workload releases a part of it's assigned resources.
+
+* Kueue Job Controller (Optional)
+  - Checking the resources owned by a Job are released to the cache and clusterQueue when a Pod of the Job succeed.
 
 ## Implementation History
 
-Dynamically Reclaiming Resources are tracked as part of [enhancement#78](https://github.com/kubernetes-sigs/kueue/issues/78).
+Dynamically Reclaiming Resources are tracked as part of [enhancement#78](https://github.com/kubernetes-sigs/kueue/issues/78).
diff --git a/keps/78-dynamically-reclaiming-resources/kep.yaml b/keps/78-dynamically-reclaiming-resources/kep.yaml
@@ -2,6 +2,7 @@ title: Dynamically Reclaiming Resources
 kep-number: 78
 authors:
   - "@thisisprasad"
+  - "@trasc"
 owning-sig: sig-scheduling
 reviewers:
   - "@alculquicondor"