Skip to content

Commit

Permalink
Merge pull request #1260 from alculquicondor/default-even-pod-spreading
Browse files Browse the repository at this point in the history
Add KEP for default Even Pods Spreading
  • Loading branch information
k8s-ci-robot authored Oct 7, 2019
2 parents d0177e1 + d36b563 commit 872fd49
Showing 1 changed file with 302 additions and 0 deletions.
302 changes: 302 additions & 0 deletions keps/sig-scheduling/20190926-default-even-pod-spreading.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,302 @@
---
title: Default Even Pod Spreading
authors:
- "@alculquicondor"
owning-sig: sig-scheduling
reviewers:
- "@ahg-g"
- "@Huang-Wei"
approvers:
- "@ahg-g"
- "@k82cn"
creation-date: 2019-09-26
last-updated: 2010-09-26
status: provisional
see-also:
- "/keps/sig-aaa/20190221-even-pods-spreading.md"
---

# Default Even Pod Spreading

## Table of Contents

<!-- toc -->
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories](#user-stories)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
- [Relationship with SelectorSpreadingPriority](#relationship-with-selectorspreadingpriority)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [API](#api)
- [Default rules](#default-rules)
- [How user stories are addressed](#how-user-stories-are-addressed)
- [Implementation Details](#implementation-details)
- [In the metadata/predicates/priorities flow](#in-the-metadatapredicatespriorities-flow)
- [In the scheduler framework](#in-the-scheduler-framework)
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
- [Implementation History](#implementation-history)
- [Alternatives](#alternatives)
<!-- /toc -->

## Release Signoff Checklist

- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
- [ ] KEP approvers have set the KEP status to `implementable`
- [ ] Design details are appropriately documented
- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [ ] Graduation criteria is in place
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

## Summary

With [Even Pods Spreading](/keps/sig-scheduling/20190221-even-pods-spreading.md),
workload authors can define spreading rules for their loads based on the topology of the clusters.
The spreading rules are defined in the `PodSpec`, thus they are tied to the pod.

We propose the introduction of configurable default spreading constraints, i.e. constraints that
can be defined at the cluster level and are applied to pods that don't explicitly define spreading constraints.
This way, all pods can be spread according to (likely better informed) constraints set by a cluster operator.
Workload authors don't need to know the topology of the cluster they will be running on to have their pods spread.
But if they do, they can still set their own spreading constraints if they have specific needs.

## Motivation

In order for a workload (pod) to use `EvenPodsSpreadPriority`:

1. Authors have to have an idea of the underlying topology.
1. PodSpecs become less portable if their spreading constraints are tailored to a specific topology.

On the other hand, cluster operators know the underlying topology of the cluster, which makes
them suitable to provide default spreading constraints for all workloads in their cluster.

### Goals

- Cluster operators can define default spreading constraints for pods that don't provide any
`pod.spec.topologySpreadConstraints`.
- Workloads are spread with the default constraints if they belong to the same service, replication controller,
replica set or stateful set, and if they don't define `pod.spec.topologySpreadConstraints`.
- Provide a k8s default for `topologySpreadConstraints` that produces a priority equivalent to
`SelectorSpreadPriority`, so that this algorithm can be removed from the default algorithms' provider.

### Non-Goals

- Set defaults for specific namespaces or according to other selectors.
- Removal of `ServiceSpreadingPriority` or `ServiceAntiAffinity` priorities.

## Proposal

### User Stories

#### Story 1

As a cluster operator, I want to set default spreading constraints for workloads in the cluster.
Currently, `SelectorSpreadPriority` provides a canned priority that spreads across nodes
and zones (`failure-domain.beta.kubernetes.io/zone`). However, the nodes in my cluster have
custom topology keys (for physical host, rack, etc.).

#### Story 2

As a workload author, I want to spread the workload in the cluster, but:
(1) I don't know the topology of the cluster I'm running on.
(2) I want to be able to run my PodSpec in different clusters (on-prem and cloud).

### Implementation Details/Notes/Constraints


#### Relationship with SelectorSpreadingPriority

Note that Default `topologySpreadConstraints` has a similar effect to `SelectorSpreadingPriority`.
Given that the latter is not configurable, they could return conflicting priorities, which
may not be the intention of an operator. On the other hand, a proper Default for `topologySpreadConstraints`
could provide the same priority as `SelectorSpreadingPriority`. Thus, there's no need for the
features to co-exist.

K8s will set Default `topologySpreadConstraints` and remove `SelectorSpreadingPriority`
from the k8s providers (`DefaultProvider` and `ClusterAutoscalerProvider`). The set
[default](#default-rules) will have a similar effect.

If an operator sets a Policy, these are the semantics of the presence of `SelectorSpreadingPriority`
and/or `EvenPodsSpreadPriority`:

| SelectorSpreading | EvenPodsSpread | Operator constraints | Valid | Pod spread constraints |
| :---------------: | :------------: | :------------------: | :---: | :---------------------------: |
| N | Y | Not provided | Yes | [k8s default](#default-rules) |
| N | Y | Provided | Yes | provided by operator |
| Y | Y | Not provided | Yes | [k8s default](#default-rules) |
| Y | Y | Provided | No | - |
| N | N | - | Yes | None |
| Y | N | - | No | - |

Selecting `SelectorSpreadingPriority` but not `EvenPodsSpreadPriority` in a policy is an invalid
configuration, because the latter is a requirement for the former.

### Risks and Mitigations

`EvenPodsSpreadPriority` has some overhead and we currently ensure that pods that don't use the
feature get minimally affected. After Default `topologySpreadConstraints` is rolled out,
all pods will run through the algorithms.
However, we should ensure that the running overhead is not significantly higher than
`SelectorSpreadingPriority` with k8s Default `topologySpreadConstraints`.

## Design Details

### API

A new structure called `TopologySpreadConstraint` is introduced to `KubeSchedulerConfiguration`.

```go
type KubeSchedulerConfiguration struct {
....
// TopologySpreadConstraints defines topology spread constraints to be applied to pods
// that don't define any in `pod.spec.topologySpreadConstraints`. Pod selectors must
// be empty, as they are deduced from the resources that the pod belongs to
// (includes services, replication controllers, replica sets and stateful sets).
// If not specified, the scheduler applies the following default constraints:
// <default rules go here. See next section>
// +optional
TopologySpreadConstraints []corev1.TopologySpreadConstraint
....
}
```

Note the use of `k8s.io/api/core/v1.TopologySpreadConstraint`. During validation,
we verify that selectors are not defined.

### Default rules

These will be the default constraints for the cluster when the operator doesn't provide any:

```yaml
defaultTopologySpreadConstraints:
- maxSkew: 1
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 1
topologyKey: "failure-domain.beta.kubernetes.io/zone"
whenUnsatisfiable: ScheduleAnyway
```
### How user stories are addressed
Let's say we have a cluster that has a topology based on physical hosts and racks.
Then, an operator can set the following scheduler configuration:
```yaml
apiVersion: componentconfig/v1alpha1
defaultTopologySpreadConstraints:
- maxSkew: 5
topologyKey: "example.com/topology/physical_host"
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 15
topologyKey: "example.com/topology/rack"
whenUnsatisfiable: DoNotSchedule
```
Then, a workload author could have the following `ReplicaSet`:

```yaml
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: replicated_demo
spec:
replicas: 3
selector:
matchLabels:
app: demo
template:
metadata:
labels:
app: demo
spec:
containers:
- name: php-redis
image: example.com/registry/demo:latest
```

Note that the workload author didn't provide spreading constraints in the `pod.spec`.
The following spreading constraints will be derived from the constraints defined in ComponentConfig,
and will be applied at runtime:

```yaml
topologySpreadConstraints:
- maxSkew: 5
TopologyKey: "example.com/topology/physical_host"
WhenUnsatisfiable: ScheduleAnyway
selector:
matchLabels:
app: demo
- maxSkew: 15
TopologyKey: "example.com/topology/rack"
WhenUnsatisfiable: DoNotSchedule
selector:
matchLabels:
app: demo
```

Please note that these constraints are honored internally in the scheduler, but they are NOT
persisted in the PodSpec via API Server.

### Implementation Details

#### In the metadata/predicates/priorities flow

1. Calculate the spreading constraints for the pod as part of the metadata calculation.
Use the constraints provided by the pod or calculate the default ones if they don't provide any.
1. When running the predicates or priorities, use the constraints stored in the metadata.

#### In the scheduler framework

1. Calculate spreading constraints for the pod in the `PreFilter` extension point. Store them
in the `PluginContext`.
1. In the `Filter` and `Score` extension points, use the stored spreading constraints instead of
the ones defined by the pod.

### Test Plan

To ensure this feature to be rolled out in high quality. Following tests are mandatory:

- **Unit Tests**: All core changes must be covered by unit tests.
- **Integration Tests**: One integration test for the default rules and one for custom rules.
- **Benchmark Tests**: A benchmark test that compare the default rules against `SelectorSpreadingPriority`.
The performance should be as close as possible.

### Graduation Criteria

Alpha (v1.17):

- [ ] Scheduler Component Config API changes.
- [ ] Default, validation and generated code.
- [ ] Priority Implementation.
- [ ] Predicate implementation.
- [ ] Test cases mentioned in the [Test Plan](#test-plan).

## Implementation History

- 2019-09-26: Initial KEP sent out for review.

## Alternatives

- Make the topology keys used in `SelectorSpreadingPriority` configurable.

While this moves the scheduler in the right direction, there are two problems:

1. We can only support one topology key.
1. It makes it hard for pods to override the operator-provided spreading rules.

- Implement a mutating controller that sets defaults.

This approach would likely allow us to provide a more flexible interface that
can set defaults for specific namespaces or with other selectors. However, that
wouldn't allow us to replace `SelectorSpreadingPriority` with
`EvenPodsSpreading`.

0 comments on commit 872fd49

Please sign in to comment.