KEP of even pods spreading

kubernetes · Feb 22, 2019 · bcd375a · bcd375a
1 parent 6828808
commit bcd375a
Showing 1 changed file with 257 additions and 0 deletions.
diff --git a/keps/sig-scheduling/20190221-even-pods-spreading.md b/keps/sig-scheduling/20190221-even-pods-spreading.md
@@ -0,0 +1,257 @@
+---
+title: Even Pods Spreading
+authors:
+  - "@Huang-Wei"
+owning-sig: sig-scheduling
+reviewers:
+  - "@wojtek-t"
+  - "@bsalamat"
+  - "@k82cn"
+approvers:
+  - "@bsalamat"
+  - "@k82cn"
+creation-date: 2019-02-21
+last-updated: 2019-02-21
+status: provisional
+---
+
+# Even Pods Spreading
+
+## Table of Contents
+
+* [Terms](#terms)
+* [Summary](#summary)
+* [Motivation](#motivation)
+  * [Goals](#goals)
+  * [Non-Goals](#non-goals)
+* [Proposal](#proposal)
+  * [User Stories](#user-stories)
+    * [Story 1 - NodeAffinity](#story-1---nodeaffinity)
+    * [Story 2 - PodAffinity](#story-2---podaffinity)
+    * [Story 3 - PodAntiAffinity](#story-3---podantiaffinity)
+  * [Risks and Mitigations](#risks-and-mitigations)
+* [Design Details](#design-details)
+  * [Algorithm](#algorithm)
+    * [NodeAffinity](#nodeaffinity)
+    * [PodAffinity](#podaffinity)
+    * [PodAntiAffinity](#podantiaffinity)
+  * [Test Plan](#test-plan)
+  * [Graduation Criteria](#graduation-criteria)
+* [Implementation History](#implementation-history)
+
+## Terms
+
+- **Topology:** describe a series of worker nodes which belongs to the same
+  region/zone/rack/hostname/etc. In terms of Kubernetes, they're defined and
+  grouped by node labels.
+- **Affinity**: if not specified particularly, "Affinity" refers to
+  `NodeAffinity`, `PodAffinity` and `PodAntiAffinity`.
+
+## Summary
+
+`EvenPodsSpreading` feature applies on NodeAffinity/PodAffinity/PodAntiAffinity
+to gives users more fine-grained control on distribution of pods scheduling, so
+as to achieve better high availability and resource utilization.
+
+## Motivation
+
+In Kubernetes, "Affinity" related directives are aimed to control how pods are
+scheduled - more packed or more scattering. But right now only limited options
+are offered: for `PodAffinity`, infinite pods can be stacked onto qualifying
+topology domain(s); for `PodAntiAffinity`, pods are scheduled onto each single
+topology domain exclusively.
+
+This is not an ideal situation if users want to put pods evenly across different
+topology domains - for the sake of high availability or saving cost. And regular
+rolling upgrade or scaling out replicas can also be problematic. See more
+details in [user stories](#user-stories).
+
+> Even pods spreading is a long-discussed topic, but it's only feasible to
+> hammer out the design/implementation details after recent Affinity performance
+> improvements.
+
+### Goals
+
+- Even spreading is achieved among pods, in the manner of NodeAffinity,
+  PodAffinity and PodAntiAffinity, and only impact
+  `RequiredDuringSchedulingIgnoredDuringExecution` affinity terms.
+- Even spreading is a predicate (hard requirement) instead of a priority (soft requirement).
+- Even spreading is implemented on limited topologies in initial version.
+
+### Non-Goals
+
+- Even spreading is NOT calculated on an application basis. In other words, it's
+  not _only_ applied within replicas of an application.
+- "Max number of pods per topology" is NOT a goal.
+- Scale-down on an application is not guaranteed to achieve even pods spreading
+  in initial implementation.
+
+## Proposal
+
+### User Stories
+
+#### Story 1 - NodeAffinity
+
+As an application developer, I want my application pods to be scheduled onto
+specific topology domains (via NodeAffinity with spec "app in [zone1, zone2,
+zone3]"), but I don't want them to be stacked too much on one topology. (see
+[#68981](https://github.com/kubernetes/kubernetes/issues/68981))
+
+#### Story 2 - PodAffinity
+
+As an application developer, I want my application pods to co-exist with
+particular pods in the same topology domain (via PodAffnity), and I want them to
+be deployed onto separate nodes as even as possible.
+
+#### Story 3 - PodAntiAffinity
+
+As an application developer, I want my application pods not to co-exist with
+specific pods (via PodAntiAffinity). But in some cases it'd be favorable to
+tolerate "violating" pods in a manageable way. For example, suppose an app
+(replicas=2) is using PodAntiAffinity and deployed onto a 2-nodes cluster, and
+next the app needs to perform a rolling upgrade, then a third replacement pod is
+created, but it failed to be placed due to lack of resource. In this case,
+
+- if HPA is enabled, a new machine will be provisioned to hold the new pod
+  (although old replicas will be deleted afterwards) (see
+  [#40358](https://github.com/kubernetes/kubernetes/issues/40358))
+- if HPA is not enabled, it's a deadlock since the replacement pod can't be
+  placed. The only workaround at this moment is to update app strategyType from
+  "RollingUpdate" to "Recreate".
+
+Both are not ideal solutions. A promising solution is to give user an option to
+trigger "toleration" mode when the cluster is out of resource. Then in
+aforementioned example, a third pod is "tolerated" to be put onto node1 (or
+node2). But keep it in mind, this behavior is only triggered upon resource
+shortage. For a 3-nodes cluster, the third pod will still be placed onto node3
+(if node3 is capable).
+
+### Risks and Mitigations
+
+Along with this feature, inevitable cost will be applied each time on
+scheduling. So to mitigate potential performance impact, initial implementation
+will limit the semantics of "even spreading" on `kubernetes.io/hostname` for
+PodAffinity and PodAntiAffinity.
+
+## Design Details
+
+We'd like to propose a new structure called `EvenSpreading`, which is a sub
+field of NodeAffinity, PodAffinity and PodAntiAffinity:
+
+```go
+type NodeAffinity struct {
+    EvenSpreading *EvenSpreading
+    ......
+}
+
+type PodAffinity struct {
+    EvenSpreading *EvenSpreading
+    ......
+}
+
+type PodAntiAffinity struct {
+    EvenSpreading *EvenSpreading
+    ......
+}
+```
+
+And it's only effective when (1) it's not nil and (2) "hard" affinity
+requirements (i.e.  `RequiredDuringSchedulingIgnoredDuringExecution`) are
+defined)
+
+API of `EvenSpreading` is defined as below:
+
+```go
+type EvenSpreading struct {
+    // MaxSkew describes the degree of imbalance of pods spreading.
+    // Default value is 1 and 0 is not allowed.
+    MaxSkew int32
+    // TopologyKey defines where pods are placed evenly
+    // - for NodeAffinity, it can be a well-known key suck as
+    //   "failure-domain.beta.kubernetes.io/region" or a self-defined key
+    // - for PodAffinity and PodAntiAffinity, it defaults to
+    //   "kubernetes.io/hostname" due to performance concern
+    TopologyKey string
+}
+```
+
+### Algorithm
+
+The pseudo algorithms to support each affinity directive is described as below.
+
+#### NodeAffinity
+
+```
+for each candidate node; do
+    if "EvenSpreading" is enabled
+        count number of matching pods on the whole topology domain
+        if matching num - minMatching num < defined MaxSkew; then
+            add it to candidate list
+        fi
+    elif "EvenSpreading" is disabled
+        keep current logic as is
+    fi
+done
+```
+
+For example, on a 2-zone cluster, suppose an application is specified with
+NodeAffinity "app in [zone1, zone2]".
+
+- If "MaxSkew" is 1, rollout of its replicas can be 1/0 => 1/1 => 2/1. It can't
+  be 2/0 or 3/1.
+- If "MaxSkew" is 2, rollout of its replicas can be 1/0 => 1/1 or 2/0.
+
+CAVEAT: current scheduler doesn't have a data structure to support this yet. A
+performance overhead is expected.
+
+#### PodAffinity
+
+```
+for each candidate node; do
+    if "EvenSpreading" is enabled
+        count number of matching pods on this node
+        if matching num - minMatching num < defined MaxSkew; then
+            add it to candidate list
+        fi
+    elif "EvenSpreading" is disabled
+        keep current logic as is
+    fi
+done
+```
+
+For example, in a 3-nodes cluster, the matching pods spread as 2/1/0,
+
+- if "MaxSkew" is 1, incoming pod can only be deployed onto node3
+- if "MaxSkew" is 2, incoming pod can be deployed onto node2 or node3
+
+#### PodAntiAffinity
+
+```
+# need to have a structure to know if there is at least one qualified candidate
+existCandidate = if there is one qualified candidate globally
+if not existCandidate and "EvenSpreading" is disabled; then
+    return
+fi
+for each candidate node; do
+    if existCandidate; then
+        add it to candidate list if this node is qualified
+    elif "EvenSpreading" is enabled; then
+        count number of miss-matching pods on this node
+        if misMatching# - minMisMatching# < defined MaxSkew; then
+            add it to candidate list
+        fi
+    fi
+done
+```
+
+### Test Plan
+
+_To be filled until targeted at a release._
+
+### Graduation Criteria
+
+_To be filled until targeted at a release._
+
+## Implementation History
+
+- 2019-02-21: Initial KEP sent out for reviewing.