-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
257 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,257 @@ | ||
--- | ||
title: Even Pods Spreading | ||
authors: | ||
- "@Huang-Wei" | ||
owning-sig: sig-scheduling | ||
reviewers: | ||
- "@wojtek-t" | ||
- "@bsalamat" | ||
- "@k82cn" | ||
approvers: | ||
- "@bsalamat" | ||
- "@k82cn" | ||
creation-date: 2019-02-21 | ||
last-updated: 2019-02-21 | ||
status: provisional | ||
--- | ||
|
||
# Even Pods Spreading | ||
|
||
## Table of Contents | ||
|
||
* [Terms](#terms) | ||
* [Summary](#summary) | ||
* [Motivation](#motivation) | ||
* [Goals](#goals) | ||
* [Non-Goals](#non-goals) | ||
* [Proposal](#proposal) | ||
* [User Stories](#user-stories) | ||
* [Story 1 - NodeAffinity](#story-1---nodeaffinity) | ||
* [Story 2 - PodAffinity](#story-2---podaffinity) | ||
* [Story 3 - PodAntiAffinity](#story-3---podantiaffinity) | ||
* [Risks and Mitigations](#risks-and-mitigations) | ||
* [Design Details](#design-details) | ||
* [Algorithm](#algorithm) | ||
* [NodeAffinity](#nodeaffinity) | ||
* [PodAffinity](#podaffinity) | ||
* [PodAntiAffinity](#podantiaffinity) | ||
* [Test Plan](#test-plan) | ||
* [Graduation Criteria](#graduation-criteria) | ||
* [Implementation History](#implementation-history) | ||
|
||
## Terms | ||
|
||
- **Topology:** describe a series of worker nodes which belongs to the same | ||
region/zone/rack/hostname/etc. In terms of Kubernetes, they're defined and | ||
grouped by node labels. | ||
- **Affinity**: if not specified particularly, "Affinity" refers to | ||
`NodeAffinity`, `PodAffinity` and `PodAntiAffinity`. | ||
|
||
## Summary | ||
|
||
`EvenPodsSpreading` feature applies on NodeAffinity/PodAffinity/PodAntiAffinity | ||
to gives users more fine-grained control on distribution of pods scheduling, so | ||
as to achieve better high availability and resource utilization. | ||
|
||
## Motivation | ||
|
||
In Kubernetes, "Affinity" related directives are aimed to control how pods are | ||
scheduled - more packed or more scattering. But right now only limited options | ||
are offered: for `PodAffinity`, infinite pods can be stacked onto qualifying | ||
topology domain(s); for `PodAntiAffinity`, pods are scheduled onto each single | ||
topology domain exclusively. | ||
|
||
This is not an ideal situation if users want to put pods evenly across different | ||
topology domains - for the sake of high availability or saving cost. And regular | ||
rolling upgrade or scaling out replicas can also be problematic. See more | ||
details in [user stories](#user-stories). | ||
|
||
> Even pods spreading is a long-discussed topic, but it's only feasible to | ||
> hammer out the design/implementation details after recent Affinity performance | ||
> improvements. | ||
### Goals | ||
|
||
- Even spreading is achieved among pods, in the manner of NodeAffinity, | ||
PodAffinity and PodAntiAffinity, and only impact | ||
`RequiredDuringSchedulingIgnoredDuringExecution` affinity terms. | ||
- Even spreading is a predicate (hard requirement) instead of a priority (soft requirement). | ||
- Even spreading is implemented on limited topologies in initial version. | ||
|
||
### Non-Goals | ||
|
||
- Even spreading is NOT calculated on an application basis. In other words, it's | ||
not _only_ applied within replicas of an application. | ||
- "Max number of pods per topology" is NOT a goal. | ||
- Scale-down on an application is not guaranteed to achieve even pods spreading | ||
in initial implementation. | ||
|
||
## Proposal | ||
|
||
### User Stories | ||
|
||
#### Story 1 - NodeAffinity | ||
|
||
As an application developer, I want my application pods to be scheduled onto | ||
specific topology domains (via NodeAffinity with spec "app in [zone1, zone2, | ||
zone3]"), but I don't want them to be stacked too much on one topology. (see | ||
[#68981](https://github.com/kubernetes/kubernetes/issues/68981)) | ||
|
||
#### Story 2 - PodAffinity | ||
|
||
As an application developer, I want my application pods to co-exist with | ||
particular pods in the same topology domain (via PodAffnity), and I want them to | ||
be deployed onto separate nodes as even as possible. | ||
|
||
#### Story 3 - PodAntiAffinity | ||
|
||
As an application developer, I want my application pods not to co-exist with | ||
specific pods (via PodAntiAffinity). But in some cases it'd be favorable to | ||
tolerate "violating" pods in a manageable way. For example, suppose an app | ||
(replicas=2) is using PodAntiAffinity and deployed onto a 2-nodes cluster, and | ||
next the app needs to perform a rolling upgrade, then a third replacement pod is | ||
created, but it failed to be placed due to lack of resource. In this case, | ||
|
||
- if HPA is enabled, a new machine will be provisioned to hold the new pod | ||
(although old replicas will be deleted afterwards) (see | ||
[#40358](https://github.com/kubernetes/kubernetes/issues/40358)) | ||
- if HPA is not enabled, it's a deadlock since the replacement pod can't be | ||
placed. The only workaround at this moment is to update app strategyType from | ||
"RollingUpdate" to "Recreate". | ||
|
||
Both are not ideal solutions. A promising solution is to give user an option to | ||
trigger "toleration" mode when the cluster is out of resource. Then in | ||
aforementioned example, a third pod is "tolerated" to be put onto node1 (or | ||
node2). But keep it in mind, this behavior is only triggered upon resource | ||
shortage. For a 3-nodes cluster, the third pod will still be placed onto node3 | ||
(if node3 is capable). | ||
|
||
### Risks and Mitigations | ||
|
||
Along with this feature, inevitable cost will be applied each time on | ||
scheduling. So to mitigate potential performance impact, initial implementation | ||
will limit the semantics of "even spreading" on `kubernetes.io/hostname` for | ||
PodAffinity and PodAntiAffinity. | ||
|
||
## Design Details | ||
|
||
We'd like to propose a new structure called `EvenSpreading`, which is a sub | ||
field of NodeAffinity, PodAffinity and PodAntiAffinity: | ||
|
||
```go | ||
type NodeAffinity struct { | ||
EvenSpreading *EvenSpreading | ||
...... | ||
} | ||
|
||
type PodAffinity struct { | ||
EvenSpreading *EvenSpreading | ||
...... | ||
} | ||
|
||
type PodAntiAffinity struct { | ||
EvenSpreading *EvenSpreading | ||
...... | ||
} | ||
``` | ||
|
||
And it's only effective when (1) it's not nil and (2) "hard" affinity | ||
requirements (i.e. `RequiredDuringSchedulingIgnoredDuringExecution`) are | ||
defined) | ||
|
||
API of `EvenSpreading` is defined as below: | ||
|
||
```go | ||
type EvenSpreading struct { | ||
// MaxSkew describes the degree of imbalance of pods spreading. | ||
// Default value is 1 and 0 is not allowed. | ||
MaxSkew int32 | ||
// TopologyKey defines where pods are placed evenly | ||
// - for NodeAffinity, it can be a well-known key suck as | ||
// "failure-domain.beta.kubernetes.io/region" or a self-defined key | ||
// - for PodAffinity and PodAntiAffinity, it defaults to | ||
// "kubernetes.io/hostname" due to performance concern | ||
TopologyKey string | ||
} | ||
``` | ||
|
||
### Algorithm | ||
|
||
The pseudo algorithms to support each affinity directive is described as below. | ||
|
||
#### NodeAffinity | ||
|
||
``` | ||
for each candidate node; do | ||
if "EvenSpreading" is enabled | ||
count number of matching pods on the whole topology domain | ||
if matching num - minMatching num < defined MaxSkew; then | ||
add it to candidate list | ||
fi | ||
elif "EvenSpreading" is disabled | ||
keep current logic as is | ||
fi | ||
done | ||
``` | ||
|
||
For example, on a 2-zone cluster, suppose an application is specified with | ||
NodeAffinity "app in [zone1, zone2]". | ||
|
||
- If "MaxSkew" is 1, rollout of its replicas can be 1/0 => 1/1 => 2/1. It can't | ||
be 2/0 or 3/1. | ||
- If "MaxSkew" is 2, rollout of its replicas can be 1/0 => 1/1 or 2/0. | ||
|
||
CAVEAT: current scheduler doesn't have a data structure to support this yet. A | ||
performance overhead is expected. | ||
|
||
#### PodAffinity | ||
|
||
``` | ||
for each candidate node; do | ||
if "EvenSpreading" is enabled | ||
count number of matching pods on this node | ||
if matching num - minMatching num < defined MaxSkew; then | ||
add it to candidate list | ||
fi | ||
elif "EvenSpreading" is disabled | ||
keep current logic as is | ||
fi | ||
done | ||
``` | ||
|
||
For example, in a 3-nodes cluster, the matching pods spread as 2/1/0, | ||
|
||
- if "MaxSkew" is 1, incoming pod can only be deployed onto node3 | ||
- if "MaxSkew" is 2, incoming pod can be deployed onto node2 or node3 | ||
|
||
#### PodAntiAffinity | ||
|
||
``` | ||
# need to have a structure to know if there is at least one qualified candidate | ||
existCandidate = if there is one qualified candidate globally | ||
if not existCandidate and "EvenSpreading" is disabled; then | ||
return | ||
fi | ||
for each candidate node; do | ||
if existCandidate; then | ||
add it to candidate list if this node is qualified | ||
elif "EvenSpreading" is enabled; then | ||
count number of miss-matching pods on this node | ||
if misMatching# - minMisMatching# < defined MaxSkew; then | ||
add it to candidate list | ||
fi | ||
fi | ||
done | ||
``` | ||
|
||
### Test Plan | ||
|
||
_To be filled until targeted at a release._ | ||
|
||
### Graduation Criteria | ||
|
||
_To be filled until targeted at a release._ | ||
|
||
## Implementation History | ||
|
||
- 2019-02-21: Initial KEP sent out for reviewing. |