Skip to content

Commit

Permalink
add details of Design B
Browse files Browse the repository at this point in the history
  • Loading branch information
Huang-Wei committed Mar 13, 2019
1 parent abb5d63 commit 4e36308
Showing 1 changed file with 125 additions and 44 deletions.
169 changes: 125 additions & 44 deletions keps/sig-scheduling/20190221-even-pods-spreading.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ approvers:
- "@bsalamat"
- "@k82cn"
creation-date: 2019-02-21
last-updated: 2019-02-27
last-updated: 2019-03-11
status: provisional
---

Expand All @@ -31,10 +31,13 @@ status: provisional
* [Story 3 - PodAntiAffinity](#story-3---podantiaffinity)
* [Risks and Mitigations](#risks-and-mitigations)
* [Design Details](#design-details)
* [Design A](#design-a)
* [Design B](#design-b)
* [Algorithm](#algorithm)
* [NodeAffinity](#nodeaffinity)
* [PodAffinity](#podaffinity)
* [PodAntiAffinity](#podantiaffinity)
* [Pros/Cons](#proscons)
* [Test Plan](#test-plan)
* [Graduation Criteria](#graduation-criteria)
* [Implementation History](#implementation-history)
Expand All @@ -46,13 +49,16 @@ status: provisional
grouped by node labels.
- **Affinity**: if not specified particularly, "Affinity" refers to
`NodeAffinity`, `PodAffinity` and `PodAntiAffinity`.
- **CA**: Cluster Autoscaler. [CA](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) is a tool that automatically adjusts the size of the Kubernetes cluster upon specific conditions.
- **CA**: Cluster Autoscaler.
[CA](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler)
is a tool that automatically adjusts the size of the Kubernetes cluster upon
specific conditions.

## Summary

`EvenPodsSpreading` feature applies on NodeAffinity/PodAffinity/PodAntiAffinity
to gives users more fine-grained control on distribution of pods scheduling, so
as to achieve better high availability and resource utilization.
`EvenPodsSpreading` feature gives users more fine-grained control on
distribution of pods scheduling, so as to achieve better high availability and
resource utilization.

## Motivation

Expand All @@ -73,11 +79,12 @@ details in [user stories](#user-stories).
### Goals

- Even spreading is achieved among pods, in the manner of NodeAffinity,
PodAffinity and PodAntiAffinity, and only impact
`RequiredDuringSchedulingIgnoredDuringExecution` affinity terms.
- Even spreading is a predicate (hard requirement) instead of a priority (soft requirement).
- Even spreading is implemented on limited topologies in initial version.
- Even spreading is calculated among pods instead of apps API (such as
Deployment, ReplicaSet).
- Even spreading can be either a predicate (hard requirement) or a priority
(soft requirement).
- Even spreading _might_ be implemented on limited topologies in initial
version.

### Non-Goals

Expand All @@ -102,7 +109,7 @@ zone3]"), but I don't want them to be stacked too much on one topology. (see

As an application developer, I want my application pods to co-exist with
particular pods in the same topology domain (via PodAffnity), and I want them to
be deployed onto separate nodes as even as possible.
be deployed onto separate nodes (or sub domains) as even as possible.

#### Story 3 - PodAntiAffinity

Expand All @@ -120,24 +127,48 @@ created, but it failed to be placed due to lack of resource. In this case,
placed. The only workaround at this moment is to update app strategyType from
"RollingUpdate" to "Recreate".

Both are not ideal solutions. A promising solution is to give user an option to
trigger "toleration" mode when the cluster is out of resource. Then in
Neither of them is an ideal solution. A promising solution is to give user an
option to trigger "toleration" mode when the cluster is out of resource. Then in
aforementioned example, a third pod is "tolerated" to be put onto node1 (or
node2). But keep it in mind, this behavior is only triggered upon resource
shortage. For a 3-nodes cluster, the third pod will still be placed onto node3
(if node3 is capable).

### Risks and Mitigations

Along with this feature, inevitable cost will be applied each time on
scheduling. So to mitigate potential performance impact, initial implementation
will limit the semantics of "even spreading" on `kubernetes.io/hostname` for
PodAffinity and PodAntiAffinity.
Along with this feature, additional inevitable cost will be applied for each
cycle of pod scheduling. So to mitigate potential performance impact, initial
implementation will _probably_ limit the semantics of "even spreading" on
`kubernetes.io/hostname`.

We also need to make sure that our implementation will not have any performance
penalty for pods that do not use this feature.

## Design Details

We'd like to propose a new structure called `EvenSpreading`, which is a sub
field of NodeAffinity, PodAffinity and PodAntiAffinity:
Basically there are two ways in API design and implementation:

1. implemented as a "sub feature" inside Affinity and hence new API applies to
`pod.spec.affinity`. (referred as [Design A](#design-b) in this doc)
1. implemented as a standalone feature and new API applies to `pod.spec`.
(referred as [Design B](#design-b) in this doc)

### Design A

A new structure called `EvenSpreading` is introduced and it's only effective
when it's not nil.

```go
type EvenSpreading struct {
// MaxSkew describes the degree of imbalance of pods spreading.
// Default value is 1 and 0 is not allowed.
MaxSkew int32
// TopologyKey defines where pods are placed evenly
TopologyKey string
}
```

In this design, `EvenSpreading` is a field in Affinity specs:

```go
type NodeAffinity struct {
Expand All @@ -156,40 +187,53 @@ type PodAntiAffinity struct {
}
```

And it's only effective when (1) it's not nil and (2) "hard" affinity
requirements (i.e. `RequiredDuringSchedulingIgnoredDuringExecution`) are
defined)
### Design B

API of `EvenSpreading` is defined as below:
Unlike Design A, `EvenSpreading` acts as a standalone spec and applies to
`pod.spec`. Similarly it's only effective when it's not nil.

```go
type PodSpec struct {
EvenSpreading *EvenSpreading
......
}
```

Inside `EvenSpreading`, we need hard affinityTerms (similar with
`PodAffinityTerm`) and soft affinityTerms (similar with
`WeightedPodAffinityTerm`). This describes when we perform even distribution,
which pods are considered as a group.

```go
type EvenSpreading struct {
// MaxSkew describes the degree of imbalance of pods spreading.
// Default value is 1 and 0 is not allowed.
MaxSkew int32
// TopologyKey defines where pods are placed evenly
// - for NodeAffinity, it can be a well-known key such as
// "failure-domain.beta.kubernetes.io/region" or a self-defined key
// - for PodAffinity and PodAntiAffinity, it defaults to
// "kubernetes.io/hostname" due to performance concern
TopologyKey string
// Similar with the same field in PodAffinity/PodAntiAffinity
// +optional
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm
// Similar with the same field in PodAffinity/PodAntiAffinity
// +optional
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm
}
```

### Algorithm

The pseudo algorithms to support each affinity directive is described as below.
The pseudo algorithms to support each user story are described as below.

#### NodeAffinity

```
```bash
for each candidate node; do
if "EvenSpreading" is enabled
count number of matching pods on the whole topology domain
if "EvenSpreading" is enabled; then
count number of matching pods on the topology domain this node belongs to
if matching num - minMatching num < defined MaxSkew; then
add it to candidate list
fi
elif "EvenSpreading" is disabled
else
keep current logic as is
fi
done
Expand All @@ -202,45 +246,57 @@ NodeAffinity "app in [zone1, zone2]".
be 2/0 or 3/1.
- If "MaxSkew" is 2, rollout of its replicas can be 1/0 => 1/1 or 2/0.

CAVEAT: current scheduler doesn't have a data structure to support this yet. A
performance overhead is expected.
This algorithm works for both Design A<sup>1</sup> and B.

<sup>1</sup> As NodeAffinityTerm doens't have info like
podSelector/namespaces/topologyKey, to make Design A work, those info need to be
inferred based on NodeAffinity spec.

#### PodAffinity

```
```bash
for each candidate node; do
if "EvenSpreading" is enabled
count number of matching pods on this node
if "EvenSpreading" is enabled; then
count number of matching pods on the topology domain this node belongs to
if matching num - minMatching num < defined MaxSkew; then
add it to candidate list
fi
elif "EvenSpreading" is disabled
elif "EvenSpreading" is disabled; then
keep current logic as is
fi
done
```

For example, in a 3-nodes cluster, the matching pods spread as 2/1/0,
For example, in a 3-zones cluster, the matching pods spread as 2/1/0, then

- if "MaxSkew" is 1, incoming pod can only be deployed onto node3
- if "MaxSkew" is 2, incoming pod can be deployed onto node2 or node3
- if "MaxSkew" is 1, incoming pod can only be deployed onto zone3 - i.e. 2/1/1
- if "MaxSkew" is 2, incoming pod can be deployed onto zone2 or zone3 - i.e.
2/2/0 or 2/1/1

This algorithm works for both Design A and B.

#### PodAntiAffinity

```
Design A provides a way to trigger "toleration" mode to broaden semantics of
PodAntiAffinity so as to provide a solution for Story 3. The algorithm is
roughly described as below:

```bash
# need to have a structure to know if there is at least one qualified candidate
existCandidate = if there is one qualified candidate globally
existCandidate = true if there is one qualified candidate globally else false
if not existCandidate and "EvenSpreading" is disabled; then
return
fi
for each candidate node; do
if existCandidate; then
add it to candidate list if this node is qualified
elif "EvenSpreading" is enabled; then
count number of miss-matching pods on this node
count number of miss-matching pods on the topology domain this node belongs to
if misMatching# - minMisMatching# < defined MaxSkew; then
add it to candidate list
fi
else
keep current logic as is
fi
done
```
Expand All @@ -249,6 +305,31 @@ done
> additional pods to co-locate in the same topology, hence the symmetry of
> PodAntiAffinity is not guaranteed as well.
While for Design B, it's not applicable to tweak semantics of PodAntiAffinity
because it works independently as a predicate/priority. Hence Design B doesn't
work for Story 3.
### Pros/Cons
**Pros of Design A:**
- Less API changes
- Less code changes (code can be built on existing InterPodPredicate, as well as
the internal data structures)
**Cons of Design A:**
- The support on NodeAffinity is vague
- Current API only supports predicate
**Pros of Design B:**
- Independent design, so can work independently with Affinity API
- Support both predicate and priority
**Cons of Design B:**
- Not working for Story 3
- More API changes
- More code changes, and some efforts of refactoring code to ensure Affinity
related structure/logic can be reused gracefully
### Test Plan
_To be filled until targeted at a release._
Expand Down

0 comments on commit 4e36308

Please sign in to comment.