Skip to content

Commit

Permalink
KEP of even pods spreading
Browse files Browse the repository at this point in the history
  • Loading branch information
Huang-Wei committed Feb 22, 2019
1 parent 6828808 commit bcd375a
Showing 1 changed file with 257 additions and 0 deletions.
257 changes: 257 additions & 0 deletions keps/sig-scheduling/20190221-even-pods-spreading.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@
---
title: Even Pods Spreading
authors:
- "@Huang-Wei"
owning-sig: sig-scheduling
reviewers:
- "@wojtek-t"
- "@bsalamat"
- "@k82cn"
approvers:
- "@bsalamat"
- "@k82cn"
creation-date: 2019-02-21
last-updated: 2019-02-21
status: provisional
---

# Even Pods Spreading

## Table of Contents

* [Terms](#terms)
* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non-Goals](#non-goals)
* [Proposal](#proposal)
* [User Stories](#user-stories)
* [Story 1 - NodeAffinity](#story-1---nodeaffinity)
* [Story 2 - PodAffinity](#story-2---podaffinity)
* [Story 3 - PodAntiAffinity](#story-3---podantiaffinity)
* [Risks and Mitigations](#risks-and-mitigations)
* [Design Details](#design-details)
* [Algorithm](#algorithm)
* [NodeAffinity](#nodeaffinity)
* [PodAffinity](#podaffinity)
* [PodAntiAffinity](#podantiaffinity)
* [Test Plan](#test-plan)
* [Graduation Criteria](#graduation-criteria)
* [Implementation History](#implementation-history)

## Terms

- **Topology:** describe a series of worker nodes which belongs to the same
region/zone/rack/hostname/etc. In terms of Kubernetes, they're defined and
grouped by node labels.
- **Affinity**: if not specified particularly, "Affinity" refers to
`NodeAffinity`, `PodAffinity` and `PodAntiAffinity`.

## Summary

`EvenPodsSpreading` feature applies on NodeAffinity/PodAffinity/PodAntiAffinity
to gives users more fine-grained control on distribution of pods scheduling, so
as to achieve better high availability and resource utilization.

## Motivation

In Kubernetes, "Affinity" related directives are aimed to control how pods are
scheduled - more packed or more scattering. But right now only limited options
are offered: for `PodAffinity`, infinite pods can be stacked onto qualifying
topology domain(s); for `PodAntiAffinity`, pods are scheduled onto each single
topology domain exclusively.

This is not an ideal situation if users want to put pods evenly across different
topology domains - for the sake of high availability or saving cost. And regular
rolling upgrade or scaling out replicas can also be problematic. See more
details in [user stories](#user-stories).

> Even pods spreading is a long-discussed topic, but it's only feasible to
> hammer out the design/implementation details after recent Affinity performance
> improvements.
### Goals

- Even spreading is achieved among pods, in the manner of NodeAffinity,
PodAffinity and PodAntiAffinity, and only impact
`RequiredDuringSchedulingIgnoredDuringExecution` affinity terms.
- Even spreading is a predicate (hard requirement) instead of a priority (soft requirement).
- Even spreading is implemented on limited topologies in initial version.

### Non-Goals

- Even spreading is NOT calculated on an application basis. In other words, it's
not _only_ applied within replicas of an application.
- "Max number of pods per topology" is NOT a goal.
- Scale-down on an application is not guaranteed to achieve even pods spreading
in initial implementation.

## Proposal

### User Stories

#### Story 1 - NodeAffinity

As an application developer, I want my application pods to be scheduled onto
specific topology domains (via NodeAffinity with spec "app in [zone1, zone2,
zone3]"), but I don't want them to be stacked too much on one topology. (see
[#68981](https://github.com/kubernetes/kubernetes/issues/68981))

#### Story 2 - PodAffinity

As an application developer, I want my application pods to co-exist with
particular pods in the same topology domain (via PodAffnity), and I want them to
be deployed onto separate nodes as even as possible.

#### Story 3 - PodAntiAffinity

As an application developer, I want my application pods not to co-exist with
specific pods (via PodAntiAffinity). But in some cases it'd be favorable to
tolerate "violating" pods in a manageable way. For example, suppose an app
(replicas=2) is using PodAntiAffinity and deployed onto a 2-nodes cluster, and
next the app needs to perform a rolling upgrade, then a third replacement pod is
created, but it failed to be placed due to lack of resource. In this case,

- if HPA is enabled, a new machine will be provisioned to hold the new pod
(although old replicas will be deleted afterwards) (see
[#40358](https://github.com/kubernetes/kubernetes/issues/40358))
- if HPA is not enabled, it's a deadlock since the replacement pod can't be
placed. The only workaround at this moment is to update app strategyType from
"RollingUpdate" to "Recreate".

Both are not ideal solutions. A promising solution is to give user an option to
trigger "toleration" mode when the cluster is out of resource. Then in
aforementioned example, a third pod is "tolerated" to be put onto node1 (or
node2). But keep it in mind, this behavior is only triggered upon resource
shortage. For a 3-nodes cluster, the third pod will still be placed onto node3
(if node3 is capable).

### Risks and Mitigations

Along with this feature, inevitable cost will be applied each time on
scheduling. So to mitigate potential performance impact, initial implementation
will limit the semantics of "even spreading" on `kubernetes.io/hostname` for
PodAffinity and PodAntiAffinity.

## Design Details

We'd like to propose a new structure called `EvenSpreading`, which is a sub
field of NodeAffinity, PodAffinity and PodAntiAffinity:

```go
type NodeAffinity struct {
EvenSpreading *EvenSpreading
......
}

type PodAffinity struct {
EvenSpreading *EvenSpreading
......
}

type PodAntiAffinity struct {
EvenSpreading *EvenSpreading
......
}
```

And it's only effective when (1) it's not nil and (2) "hard" affinity
requirements (i.e. `RequiredDuringSchedulingIgnoredDuringExecution`) are
defined)

API of `EvenSpreading` is defined as below:

```go
type EvenSpreading struct {
// MaxSkew describes the degree of imbalance of pods spreading.
// Default value is 1 and 0 is not allowed.
MaxSkew int32
// TopologyKey defines where pods are placed evenly
// - for NodeAffinity, it can be a well-known key suck as
// "failure-domain.beta.kubernetes.io/region" or a self-defined key
// - for PodAffinity and PodAntiAffinity, it defaults to
// "kubernetes.io/hostname" due to performance concern
TopologyKey string
}
```

### Algorithm

The pseudo algorithms to support each affinity directive is described as below.

#### NodeAffinity

```
for each candidate node; do
if "EvenSpreading" is enabled
count number of matching pods on the whole topology domain
if matching num - minMatching num < defined MaxSkew; then
add it to candidate list
fi
elif "EvenSpreading" is disabled
keep current logic as is
fi
done
```

For example, on a 2-zone cluster, suppose an application is specified with
NodeAffinity "app in [zone1, zone2]".

- If "MaxSkew" is 1, rollout of its replicas can be 1/0 => 1/1 => 2/1. It can't
be 2/0 or 3/1.
- If "MaxSkew" is 2, rollout of its replicas can be 1/0 => 1/1 or 2/0.

CAVEAT: current scheduler doesn't have a data structure to support this yet. A
performance overhead is expected.

#### PodAffinity

```
for each candidate node; do
if "EvenSpreading" is enabled
count number of matching pods on this node
if matching num - minMatching num < defined MaxSkew; then
add it to candidate list
fi
elif "EvenSpreading" is disabled
keep current logic as is
fi
done
```

For example, in a 3-nodes cluster, the matching pods spread as 2/1/0,

- if "MaxSkew" is 1, incoming pod can only be deployed onto node3
- if "MaxSkew" is 2, incoming pod can be deployed onto node2 or node3

#### PodAntiAffinity

```
# need to have a structure to know if there is at least one qualified candidate
existCandidate = if there is one qualified candidate globally
if not existCandidate and "EvenSpreading" is disabled; then
return
fi
for each candidate node; do
if existCandidate; then
add it to candidate list if this node is qualified
elif "EvenSpreading" is enabled; then
count number of miss-matching pods on this node
if misMatching# - minMisMatching# < defined MaxSkew; then
add it to candidate list
fi
fi
done
```

### Test Plan

_To be filled until targeted at a release._

### Graduation Criteria

_To be filled until targeted at a release._

## Implementation History

- 2019-02-21: Initial KEP sent out for reviewing.

0 comments on commit bcd375a

Please sign in to comment.