add details of Design B

kubernetes · Mar 13, 2019 · 4e36308 · 4e36308
1 parent abb5d63
commit 4e36308
Showing 1 changed file with 125 additions and 44 deletions.
diff --git a/keps/sig-scheduling/20190221-even-pods-spreading.md b/keps/sig-scheduling/20190221-even-pods-spreading.md
@@ -11,7 +11,7 @@ approvers:
   - "@bsalamat"
   - "@k82cn"
 creation-date: 2019-02-21
-last-updated: 2019-02-27
+last-updated: 2019-03-11
 status: provisional
 ---
 
@@ -31,10 +31,13 @@ status: provisional
     * [Story 3 - PodAntiAffinity](#story-3---podantiaffinity)
   * [Risks and Mitigations](#risks-and-mitigations)
 * [Design Details](#design-details)
+  * [Design A](#design-a)
+  * [Design B](#design-b)
   * [Algorithm](#algorithm)
     * [NodeAffinity](#nodeaffinity)
     * [PodAffinity](#podaffinity)
     * [PodAntiAffinity](#podantiaffinity)
+  * [Pros/Cons](#proscons)
   * [Test Plan](#test-plan)
   * [Graduation Criteria](#graduation-criteria)
 * [Implementation History](#implementation-history)
@@ -46,13 +49,16 @@ status: provisional
   grouped by node labels.
 - **Affinity**: if not specified particularly, "Affinity" refers to
   `NodeAffinity`, `PodAffinity` and `PodAntiAffinity`.
-- **CA**: Cluster Autoscaler. [CA](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) is a tool that automatically adjusts the size of the Kubernetes cluster upon specific conditions.
+- **CA**: Cluster Autoscaler.
+  [CA](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler)
+  is a tool that automatically adjusts the size of the Kubernetes cluster upon
+  specific conditions.
 
 ## Summary
 
-`EvenPodsSpreading` feature applies on NodeAffinity/PodAffinity/PodAntiAffinity
-to gives users more fine-grained control on distribution of pods scheduling, so
-as to achieve better high availability and resource utilization.
+`EvenPodsSpreading` feature gives users more fine-grained control on
+distribution of pods scheduling, so as to achieve better high availability and
+resource utilization.
 
 ## Motivation
 
@@ -73,11 +79,12 @@ details in [user stories](#user-stories).
 
 ### Goals
 
-- Even spreading is achieved among pods, in the manner of NodeAffinity,
-  PodAffinity and PodAntiAffinity, and only impact
-  `RequiredDuringSchedulingIgnoredDuringExecution` affinity terms.
-- Even spreading is a predicate (hard requirement) instead of a priority (soft requirement).
-- Even spreading is implemented on limited topologies in initial version.
+- Even spreading is calculated among pods instead of apps API (such as
+  Deployment, ReplicaSet).
+- Even spreading can be either a predicate (hard requirement) or a priority
+  (soft requirement).
+- Even spreading _might_ be implemented on limited topologies in initial
+  version.
 
 ### Non-Goals
 
@@ -102,7 +109,7 @@ zone3]"), but I don't want them to be stacked too much on one topology. (see
 
 As an application developer, I want my application pods to co-exist with
 particular pods in the same topology domain (via PodAffnity), and I want them to
-be deployed onto separate nodes as even as possible.
+be deployed onto separate nodes (or sub domains) as even as possible.
 
 #### Story 3 - PodAntiAffinity
 
@@ -120,24 +127,48 @@ created, but it failed to be placed due to lack of resource. In this case,
   placed. The only workaround at this moment is to update app strategyType from
   "RollingUpdate" to "Recreate".
 
-Both are not ideal solutions. A promising solution is to give user an option to
-trigger "toleration" mode when the cluster is out of resource. Then in
+Neither of them is an ideal solution. A promising solution is to give user an
+option to trigger "toleration" mode when the cluster is out of resource. Then in
 aforementioned example, a third pod is "tolerated" to be put onto node1 (or
 node2). But keep it in mind, this behavior is only triggered upon resource
 shortage. For a 3-nodes cluster, the third pod will still be placed onto node3
 (if node3 is capable).
 
 ### Risks and Mitigations
 
-Along with this feature, inevitable cost will be applied each time on
-scheduling. So to mitigate potential performance impact, initial implementation
-will limit the semantics of "even spreading" on `kubernetes.io/hostname` for
-PodAffinity and PodAntiAffinity.
+Along with this feature, additional inevitable cost will be applied for each
+cycle of pod scheduling. So to mitigate potential performance impact, initial
+implementation will _probably_ limit the semantics of "even spreading" on
+`kubernetes.io/hostname`.
+
+We also need to make sure that our implementation will not have any performance
+penalty for pods that do not use this feature.
 
 ## Design Details
 
-We'd like to propose a new structure called `EvenSpreading`, which is a sub
-field of NodeAffinity, PodAffinity and PodAntiAffinity:
+Basically there are two ways in API design and implementation:
+
+1. implemented as a "sub feature" inside Affinity and hence new API applies to
+   `pod.spec.affinity`. (referred as [Design A](#design-b) in this doc)
+1. implemented as a standalone feature and new API applies to `pod.spec`.
+   (referred as [Design B](#design-b) in this doc)
+
+### Design A
+
+A new structure called `EvenSpreading` is introduced and it's only effective
+when it's not nil.
+
+```go
+type EvenSpreading struct {
+    // MaxSkew describes the degree of imbalance of pods spreading.
+    // Default value is 1 and 0 is not allowed.
+    MaxSkew int32
+    // TopologyKey defines where pods are placed evenly
+    TopologyKey string
+}
+```
+
+In this design, `EvenSpreading` is a field in Affinity specs:
 
 ```go
 type NodeAffinity struct {
@@ -156,40 +187,53 @@ type PodAntiAffinity struct {
 }
 ```
 
-And it's only effective when (1) it's not nil and (2) "hard" affinity
-requirements (i.e.  `RequiredDuringSchedulingIgnoredDuringExecution`) are
-defined)
+### Design B
 
-API of `EvenSpreading` is defined as below:
+Unlike Design A, `EvenSpreading` acts as a standalone spec and applies to
+`pod.spec`. Similarly it's only effective when it's not nil.
+
+```go
+type PodSpec struct {
+    EvenSpreading *EvenSpreading
+    ......
+}
+```
+
+Inside `EvenSpreading`, we need hard affinityTerms (similar with
+`PodAffinityTerm`) and soft affinityTerms (similar with
+`WeightedPodAffinityTerm`). This describes when we perform even distribution,
+which pods are considered as a group.
 
 ```go
 type EvenSpreading struct {
     // MaxSkew describes the degree of imbalance of pods spreading.
     // Default value is 1 and 0 is not allowed.
     MaxSkew int32
     // TopologyKey defines where pods are placed evenly
-    // - for NodeAffinity, it can be a well-known key such as
-    //   "failure-domain.beta.kubernetes.io/region" or a self-defined key
-    // - for PodAffinity and PodAntiAffinity, it defaults to
-    //   "kubernetes.io/hostname" due to performance concern
     TopologyKey string
+    // Similar with the same field in PodAffinity/PodAntiAffinity
+    // +optional
+    RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm
+    // Similar with the same field in PodAffinity/PodAntiAffinity
+    // +optional
+    PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm
 }
 ```
 
 ### Algorithm
 
-The pseudo algorithms to support each affinity directive is described as below.
+The pseudo algorithms to support each user story are described as below.
 
 #### NodeAffinity
 
-```
+```bash
 for each candidate node; do
-    if "EvenSpreading" is enabled
-        count number of matching pods on the whole topology domain
+    if "EvenSpreading" is enabled; then
+        count number of matching pods on the topology domain this node belongs to
         if matching num - minMatching num < defined MaxSkew; then
             add it to candidate list
         fi
-    elif "EvenSpreading" is disabled
+    else
         keep current logic as is
     fi
 done
@@ -202,45 +246,57 @@ NodeAffinity "app in [zone1, zone2]".
   be 2/0 or 3/1.
 - If "MaxSkew" is 2, rollout of its replicas can be 1/0 => 1/1 or 2/0.
 
-CAVEAT: current scheduler doesn't have a data structure to support this yet. A
-performance overhead is expected.
+This algorithm works for both Design A<sup>1</sup> and B.
+
+<sup>1</sup> As NodeAffinityTerm doens't have info like
+podSelector/namespaces/topologyKey, to make Design A work, those info need to be
+inferred based on NodeAffinity spec.
 
 #### PodAffinity
 
-```
+```bash
 for each candidate node; do
-    if "EvenSpreading" is enabled
-        count number of matching pods on this node
+    if "EvenSpreading" is enabled; then
+        count number of matching pods on the topology domain this node belongs to
         if matching num - minMatching num < defined MaxSkew; then
             add it to candidate list
         fi
-    elif "EvenSpreading" is disabled
+    elif "EvenSpreading" is disabled; then
         keep current logic as is
     fi
 done
 ```
 
-For example, in a 3-nodes cluster, the matching pods spread as 2/1/0,
+For example, in a 3-zones cluster, the matching pods spread as 2/1/0, then
 
-- if "MaxSkew" is 1, incoming pod can only be deployed onto node3
-- if "MaxSkew" is 2, incoming pod can be deployed onto node2 or node3
+- if "MaxSkew" is 1, incoming pod can only be deployed onto zone3 - i.e. 2/1/1
+- if "MaxSkew" is 2, incoming pod can be deployed onto zone2 or zone3 - i.e.
+  2/2/0 or 2/1/1
+
+This algorithm works for both Design A and B.
 
 #### PodAntiAffinity
 
-```
+Design A provides a way to trigger "toleration" mode to broaden semantics of
+PodAntiAffinity so as to provide a solution for Story 3. The algorithm is
+roughly described as below:
+
+```bash
 # need to have a structure to know if there is at least one qualified candidate
-existCandidate = if there is one qualified candidate globally
+existCandidate = true if there is one qualified candidate globally else false
 if not existCandidate and "EvenSpreading" is disabled; then
     return
 fi
 for each candidate node; do
     if existCandidate; then
         add it to candidate list if this node is qualified
     elif "EvenSpreading" is enabled; then
-        count number of miss-matching pods on this node
+        count number of miss-matching pods on the topology domain this node belongs to
         if misMatching# - minMisMatching# < defined MaxSkew; then
             add it to candidate list
         fi
+    else
+        keep current logic as is
     fi
 done
 ```
@@ -249,6 +305,31 @@ done
 > additional pods to co-locate in the same topology, hence the symmetry of
 > PodAntiAffinity is not guaranteed as well.
 
+While for Design B, it's not applicable to tweak semantics of PodAntiAffinity
+because it works independently as a predicate/priority. Hence Design B doesn't
+work for Story 3.
+
+### Pros/Cons
+
+**Pros of Design A:**
+- Less API changes
+- Less code changes (code can be built on existing InterPodPredicate, as well as
+  the internal data structures)
+
+**Cons of Design A:**
+- The support on NodeAffinity is vague
+- Current API only supports predicate
+
+**Pros of Design B:**
+- Independent design, so can work independently with Affinity API
+- Support both predicate and priority
+
+**Cons of Design B:**
+- Not working for Story 3
+- More API changes
+- More code changes, and some efforts of refactoring code to ensure Affinity
+  related structure/logic can be reused gracefully
+
 ### Test Plan
 
 _To be filled until targeted at a release._