From 2efa2281c339d40f6beb1a448e28bf53421c461b Mon Sep 17 00:00:00 2001 From: Vallery Lancey Date: Sun, 17 Mar 2019 21:30:20 -0700 Subject: [PATCH 1/5] Added KEP for non-preempting pod priorities --- .../20190317-non-preempting-priorityclass | 104 ++++++++++++++++++ 1 file changed, 104 insertions(+) create mode 100644 keps/sig-scheduling/20190317-non-preempting-priorityclass diff --git a/keps/sig-scheduling/20190317-non-preempting-priorityclass b/keps/sig-scheduling/20190317-non-preempting-priorityclass new file mode 100644 index 00000000000..c16dea96d86 --- /dev/null +++ b/keps/sig-scheduling/20190317-non-preempting-priorityclass @@ -0,0 +1,104 @@ +--- +title: Add NonPreempting Option For PriorityClasses +authors: + - "@vllry" +owning-sig: sig-scheduling +participating-sigs: + - sig-scheduling +reviewers: + - "k82cn" + - "wgliang" +approvers: + - "bsalamat" +editor: Vallery Lancey +creation-date: 2019-03-17 +last-updated: 2019-03-17 +status: implementable +see-also: +replaces: +superseded-by: +--- + +# Allow PriorityClasses To Be Non-Preempting + +## Table of Contents + +* [Table of Contents](#table-of-contents) +* [Summary](#summary) +* [Motivation](#motivation) + * [Goals](#goals) + * [Non-Goals](#non-goals) +* [Proposal](#proposal) + * [Risks and Mitigations](#risks-and-mitigations) +* [Graduation Criteria](#graduation-criteria) +* [Implementation History](#implementation-history) + + +## Summary + +[PriorityClasses](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/) are a beta feature, +which impact the scheduling and eviction of pods. +Pods will be scheduled according to descending priority. +If a pod cannot be scheduled due to insufficient resources, +lower-priority pods will be descheduled to make room. + +This proposal makes the pre-empting (descheduling) behavior optional, +by adding a new field to PriorityClasses. +If a PriorityClass does not have preemption enabled, +the scheduler will not preempt pods in order to schedule a pod of that priority. + +## Motivation + +High-priority, non-preempting workloads are a common data science use case. +Preempting batch workloads is a waste, as the work unit must be repeated. + +### Goals + +Add a boolean to PriorityClasses, +to enable or disable preemption for pods of that PriorityClass. + +### Non-Goals + +## Proposal + +Add a NonPreempting field to PriorityClasses. +This field will default to false, +for backwards compatibility. + +If NonPreempting is false, +the scheduler will preempt lower priority pods to schedule this pod, +as is current behavior. + +If NonPreempting is true, +a pod of that class will not preempt other pods if it cannot be scheduled. + +Update our documentation to reflect this new feature. + +### Risks and Mitigations + +The new feature may malfuction, +or preemption may be accidentally impaired. +New tests (covering both nonpreepting workloads and mixed workloads), +and the existing preempting PriorityClass tests should be used to prove stability. + +## Graduation Criteria + +* Users are reporting that this resolves their workload priority use-cases +(if not, additional enhancements would be tightly linked to this one). +* The feature has been stable and reliable in at least 2 releases. +* Adequate documentation exists for preemption and the optional field. +* Test coverage includes non-preempting use cases. +* Conformance requirements for non-preempting PriorityClasses are agreed upon. + +## Testing Plan +Add unit and e2e tests for nonpreempting PriorityClasses to the existing scheduler tests. + +Ensure existing tests (for preempting PriorityClasses) do not break. + +## Implementation History + +[Original Github issue](https://github.com/kubernetes/kubernetes/issues/67671) + +Pod Priority and Preemption are tracked as part of [enhancement#564](https://github.com/kubernetes/enhancements/issues/564). +The proposal for Pod Priority can be [found here](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/pod-priority-api.md) +and Preemption proposal is [here](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/pod-preemption.md). From 16d2d6452527aae174d2f184b7cd03d6d64f975f Mon Sep 17 00:00:00 2001 From: Vallery Lancey Date: Mon, 18 Mar 2019 23:09:30 -0700 Subject: [PATCH 2/5] Fact checks, clarifications, cleanup. --- ... 20190317-non-preempting-priorityclass.md} | 30 +++++++++++++------ 1 file changed, 21 insertions(+), 9 deletions(-) rename keps/sig-scheduling/{20190317-non-preempting-priorityclass => 20190317-non-preempting-priorityclass.md} (82%) diff --git a/keps/sig-scheduling/20190317-non-preempting-priorityclass b/keps/sig-scheduling/20190317-non-preempting-priorityclass.md similarity index 82% rename from keps/sig-scheduling/20190317-non-preempting-priorityclass rename to keps/sig-scheduling/20190317-non-preempting-priorityclass.md index c16dea96d86..9b54907bf35 100644 --- a/keps/sig-scheduling/20190317-non-preempting-priorityclass +++ b/keps/sig-scheduling/20190317-non-preempting-priorityclass.md @@ -36,16 +36,16 @@ superseded-by: ## Summary -[PriorityClasses](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/) are a beta feature, +[PriorityClasses](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/) are a GA feature as on 1.14, which impact the scheduling and eviction of pods. Pods will be scheduled according to descending priority. If a pod cannot be scheduled due to insufficient resources, -lower-priority pods will be descheduled to make room. +lower-priority pods will be descheduled ("preempted") to make room. -This proposal makes the pre-empting (descheduling) behavior optional, +This proposal makes the preempting behavior optional for a PriorityClass, by adding a new field to PriorityClasses. If a PriorityClass does not have preemption enabled, -the scheduler will not preempt pods in order to schedule a pod of that priority. +a pod of that PriorityClass will not trigger preemption of other pods. ## Motivation @@ -61,16 +61,28 @@ to enable or disable preemption for pods of that PriorityClass. ## Proposal -Add a NonPreempting field to PriorityClasses. -This field will default to false, +Add a Preempting field to PriorityClasses. +This field will default to true, for backwards compatibility. -If NonPreempting is false, +``` +type PriorityClass struct { + metav1.TypeMeta + metav1.ObjectMeta + Value int32 + GlobalDefault bool + Description string + // New option + Preempting bool +} +``` + +If Preempting is true for a pod, the scheduler will preempt lower priority pods to schedule this pod, as is current behavior. -If NonPreempting is true, -a pod of that class will not preempt other pods if it cannot be scheduled. +If Preempting is false, +a pod of that class will not preempt other pods. Update our documentation to reflect this new feature. From d464db791625b0247015843c0908c43d593d14d0 Mon Sep 17 00:00:00 2001 From: Vallery Lancey Date: Thu, 28 Mar 2019 20:50:47 -0700 Subject: [PATCH 3/5] Changed Preempting field to be duplicated between PriorityClass and PodSpec. --- .../20190317-non-preempting-priorityclass.md | 44 ++++++++++++++----- 1 file changed, 33 insertions(+), 11 deletions(-) diff --git a/keps/sig-scheduling/20190317-non-preempting-priorityclass.md b/keps/sig-scheduling/20190317-non-preempting-priorityclass.md index 9b54907bf35..490c4b8f8c1 100644 --- a/keps/sig-scheduling/20190317-non-preempting-priorityclass.md +++ b/keps/sig-scheduling/20190317-non-preempting-priorityclass.md @@ -12,7 +12,7 @@ approvers: - "bsalamat" editor: Vallery Lancey creation-date: 2019-03-17 -last-updated: 2019-03-17 +last-updated: 2019-03-28 status: implementable see-also: replaces: @@ -61,10 +61,21 @@ to enable or disable preemption for pods of that PriorityClass. ## Proposal -Add a Preempting field to PriorityClasses. +Add a Preempting field to both PodSpec and PriorityClass. This field will default to true, for backwards compatibility. +If Preempting is true for a pod, +the scheduler will preempt lower priority pods to schedule this pod, +as is current behavior. + +If Preempting is false, +a pod of that priority will not preempt other pods. + +Setting the Preempting field in PriorityClass provides a straightforward interface, +and allows ResourceQuotas to restrict preemption. + +PriorityClass type example: ``` type PriorityClass struct { metav1.TypeMeta @@ -72,19 +83,29 @@ type PriorityClass struct { Value int32 GlobalDefault bool Description string - // New option - Preempting bool + Preempting *bool // New option } ``` -If Preempting is true for a pod, -the scheduler will preempt lower priority pods to schedule this pod, -as is current behavior. +The Preempting field in PodSpec will be populated during pod admission, +similarly to how the PriorityClass Value is populated. +Storing the Preempting field in the pod spec has several benefits: +* The scheduler does not need to be aware of PiorityClasses, +as all relevant information is in the pod. +* Mutating PriorityClass objects does not impact existing pods. +* Kubelets can set Preempting on static pods. -If Preempting is false, -a pod of that class will not preempt other pods. +PodSpec type example: +``` +type PodSpec struct { + ... + Preempting *bool + ... +} +``` -Update our documentation to reflect this new feature. +Documentation must be updated to reflect the new feature, +and changes to PriorityClass/PodSpec fields. ### Risks and Mitigations @@ -103,7 +124,8 @@ and the existing preempting PriorityClass tests should be used to prove stabilit * Conformance requirements for non-preempting PriorityClasses are agreed upon. ## Testing Plan -Add unit and e2e tests for nonpreempting PriorityClasses to the existing scheduler tests. +Add unit, integration e2e tests for nonpreempting PriorityClasses to the existing scheduler tests. +Integration tests should be a focus. Ensure existing tests (for preempting PriorityClasses) do not break. From e28009930eb9aee1d9c817ad48d2f0cf14240522 Mon Sep 17 00:00:00 2001 From: Vallery Lancey Date: Thu, 28 Mar 2019 21:15:49 -0700 Subject: [PATCH 4/5] Address clarity and motiviation --- .../20190317-non-preempting-priorityclass.md | 35 ++++++++++++++----- 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/keps/sig-scheduling/20190317-non-preempting-priorityclass.md b/keps/sig-scheduling/20190317-non-preempting-priorityclass.md index 490c4b8f8c1..14f5afa3269 100644 --- a/keps/sig-scheduling/20190317-non-preempting-priorityclass.md +++ b/keps/sig-scheduling/20190317-non-preempting-priorityclass.md @@ -38,19 +38,33 @@ superseded-by: [PriorityClasses](https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/) are a GA feature as on 1.14, which impact the scheduling and eviction of pods. -Pods will be scheduled according to descending priority. +Pods are be scheduled according to descending priority. If a pod cannot be scheduled due to insufficient resources, -lower-priority pods will be descheduled ("preempted") to make room. +lower-priority pods will be preempted to make room. This proposal makes the preempting behavior optional for a PriorityClass, -by adding a new field to PriorityClasses. -If a PriorityClass does not have preemption enabled, -a pod of that PriorityClass will not trigger preemption of other pods. +by adding a new field to PriorityClasses, +which in turn populates PodSpec. +If a pod is waiting to be scheduled, +and it does not have preemption enabled, +it will not trigger preemption of other pods. ## Motivation -High-priority, non-preempting workloads are a common data science use case. -Preempting batch workloads is a waste, as the work unit must be repeated. +Allowing PriorityClasses to be non-preempting is important for running batch workloads. + +Batch workloads typically have a backlog of work, +with unscheduled pods. +Higher-priority workloads can be assigned a higher priority via a PriorityClass, +to ensure they go to the front of the scheduling queue. +However, +preempting batch workloads is undesirable, +as all work done by the preempted pod is typically lost. + +Users could create a non-preempting PriorityClasses, +to ensure their most time-sensitive workloads are scheduled before other queued pods, +without risking discarding the work of running pods. + ### Goals @@ -59,6 +73,8 @@ to enable or disable preemption for pods of that PriorityClass. ### Non-Goals +* Protecting pods from preemption. PodDisruptionBudget should be used. + ## Proposal Add a Preempting field to both PodSpec and PriorityClass. @@ -124,8 +140,9 @@ and the existing preempting PriorityClass tests should be used to prove stabilit * Conformance requirements for non-preempting PriorityClasses are agreed upon. ## Testing Plan -Add unit, integration e2e tests for nonpreempting PriorityClasses to the existing scheduler tests. -Integration tests should be a focus. +Add detailed unit and integration tests for nonpreempting workloads. + +Add basic e2e tests, to ensure all components are working together. Ensure existing tests (for preempting PriorityClasses) do not break. From ca08319bcb081d16088a0d1437dea10ff348fefb Mon Sep 17 00:00:00 2001 From: Vallery Lancey Date: Fri, 29 Mar 2019 17:27:54 -0700 Subject: [PATCH 5/5] Added feature gate, shuffled use case info. --- .../20190317-non-preempting-priorityclass.md | 30 +++++++++++-------- 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/keps/sig-scheduling/20190317-non-preempting-priorityclass.md b/keps/sig-scheduling/20190317-non-preempting-priorityclass.md index 14f5afa3269..f338e23f2b9 100644 --- a/keps/sig-scheduling/20190317-non-preempting-priorityclass.md +++ b/keps/sig-scheduling/20190317-non-preempting-priorityclass.md @@ -56,15 +56,9 @@ Allowing PriorityClasses to be non-preempting is important for running batch wor Batch workloads typically have a backlog of work, with unscheduled pods. Higher-priority workloads can be assigned a higher priority via a PriorityClass, -to ensure they go to the front of the scheduling queue. -However, -preempting batch workloads is undesirable, -as all work done by the preempted pod is typically lost. - -Users could create a non-preempting PriorityClasses, -to ensure their most time-sensitive workloads are scheduled before other queued pods, -without risking discarding the work of running pods. - +but this may result in pods with partially-completed work being preempted. +Adding the non-preempting option allows users to prioritize the scheduling queue, +without discarding incomplete work. ### Goals @@ -120,20 +114,32 @@ type PodSpec struct { } ``` +This feature should be gated in alpha, provisionally under the gate `NonPreemptingPriority`. + Documentation must be updated to reflect the new feature, and changes to PriorityClass/PodSpec fields. ### Risks and Mitigations The new feature may malfuction, -or preemption may be accidentally impaired. +or existing preemption functionality may be impaired. New tests (covering both nonpreepting workloads and mixed workloads), and the existing preempting PriorityClass tests should be used to prove stability. ## Graduation Criteria -* Users are reporting that this resolves their workload priority use-cases -(if not, additional enhancements would be tightly linked to this one). +**Typical user story:** +A user is running batch workloads on a cluster. +The user has a high-priority job, +that they wish to schedule before other workloads in the queue. +As the user does not want to preempt running batch workloads and discard work, +the user creates the new workload with a high-priority, +non-preempting PriorityClass. +The new workload's pods are scheduled ahead of the queue, +without disrupting running workloads. + +* Users are able to run preempting and non-preempting workloads in a stable manner, +and are not requesting additional changes. * The feature has been stable and reliable in at least 2 releases. * Adequate documentation exists for preemption and the optional field. * Test coverage includes non-preempting use cases.