Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CPU manager proposal. #654

Merged
merged 15 commits into from
Aug 1, 2017
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
307 changes: 307 additions & 0 deletions contributors/design-proposals/cpu-manager.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,307 @@
# CPU Manager

_Authors:_

* @ConnorDoyle - Connor Doyle <[email protected]>
* @flyingcougar - Szymon Scharmach <[email protected]>
* @sjenning - Seth Jennings <[email protected]>

**Contents:**

* [Overview](#overview)
* [Proposed changes](#proposed-changes)
* [Operations and observability](#operations-and-observability)
* [Practical challenges](#practical-challenges)
* [Implementation roadmap](#implementation-roadmap)
* [Appendix A: cpuset pitfalls](#appendix-a-cpuset-pitfalls)

## Overview

_Problems to solve:_

1. Poor or unpredictable performance observed compared to virtual machine
based orchestration systems. Application latency and lower CPU
throughput compared to VMs due to cpu quota being fulfilled across all
cores, rather than exclusive cores, which results in fewer context
switches and higher cache affinity.
1. Unacceptable latency attributed to the OS process scheduler, especially
for “fast” virtual network functions (want to approach line rate on
modern server NICs.)

_Solution requirements:_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a requirement that a container must be able to discover the set of cpus it has been assigned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree that this should be a requirement. I see the value, where people are wanting to do thread affinity inside the pod, but it comes with baggage.

It pins us in when we go to add the dynamic cpuset manager later.

It also creates problem in the kubelet restart case. If our contract is "I'll put you on an exclusive core", we have to freedom to "reschedule" on kubelet restart rather than writing a nasty cgroup parser to reconstitute the cpuset manager state.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reschedule as in possibly shift a container to a new core after it's already been running? That would be rather scary for many workloads\users if that happened.

The cpuset is just a bit flag which is easy to parse?

Could the assigned cores be saved to etcd or a local cache so it's recoverable perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chino the CPU mask would only be changed after process start if the dynamic CPU Manager policy is enabled. The static policy guarantees never to do that.

@derekwaynecarr and @sjenning: Inside the container, workloads that care about their CPU mask can easily get it by reading /proc/self/status (see below). Is that sufficient?

connor:~/ $ grep -i cpus /proc/self/status
Cpus_allowed:   77
Cpus_allowed_list:      0-2,4-6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a dynamic policy, we should send a send a signal. For static, I agree it's not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backfilling from off-list discussion: we talked today about providing a signal in the following way. We could project (a subset of) the CPU manager state into a volume visible to selected containers. User workloads could subscribe to update events in a normal Linux manner (e.g. inotify.) We also decided, more or less, that this won't be necessary until the dynamic policy is implemented. We can document that user containers in the shared pool must avoid setting their own CPU affinity, since it will be overwritten without notice any time the membership of the shared pool changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catching up on doc, so this may be covered elsewhere, but we should say something about signalling in a future iteration of this proposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add it to the dynamic policy section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


1. Provide an API-driven contract from the system to a user: "if you are a
Guaranteed pod with 1 or more cores of cpu, the system will try to make
sure that the pod gets its cpu quota primarily from reserved core(s),
resulting in fewer context switches and higher cache affinity".
1. Support the case where in a given pod, one container is latency-critical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume this is the case where a pod has a container that uses integral cores and another container that uses fractional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's how the static policy as written satisfies this requirement. Does that need clarification here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its probably ok, i have too much history in the discussion that as phrased here, it signals something deeper.

and another is not (e.g. auxillary side-car containers responsible for
log forwarding, metrics collection and the like.)
1. Do not cap CPU quota for guaranteed containers that are granted
exclusive cores, since that would be antithetical to (1) above.
1. Take physical processor topology into account in the CPU affinity policy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: incorrect numbering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is github markdown. When viewed as markdown, not source, the numbering is correct.


### Related issues

* Feature: [Further differentiate performance characteristics associated
with pod level QoS](https://github.com/kubernetes/features/issues/276)

## Proposed changes

### CPU Manager component

The *CPU Manager* is a new software component in Kubelet responsible for
assigning pod containers to sets of CPUs on the local node. In later
phases, the scope will expand to include caches, a critical shared
processor resource.

The CPU manager interacts directly with the kuberuntime. The CPU Manager
is notified when containers come and go, before delegating container
creation via the container runtime interface and after the container's
destruction respectively. The CPU Manager emits CPU settings for
containers in response.

#### Discovering CPU topology

The CPU Manager must understand basic topology. First of all, it must
determine the number of logical CPUs (hardware threads) available for
allocation. On architectures that support [hyper-threading][ht], sibling
threads share a number of hardware resources including the cache
hierarchy. On multi-socket systems, logical CPUs co-resident on a socket
share L3 cache. Although there may be some programs that benefit from
disjoint caches, the policies described in this proposal assume cache
affinity will yield better application and overall system performance for
most cases. In all scenarios described below, we prefer to acquire logical
CPUs topologically. For example, allocating two CPUs on a system that has
hyper-threading turned on yields both sibling threads on the same
physical core. Likewise, allocating two CPUs on a non-hyper-threaded
system yields two cores on the same socket.
Copy link

@teferi teferi Jun 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should happen if there are enough CPUs on a node, but they're far away from each other (I'm requesting 2 CPUs, but the ones left are on different cores). For some workloads admitting such Pod may be acceptable, for some not. Would it be a good idea to make this behaviour configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, our plan is to have the first policy (static) provide the "best" allocation when assigning dedicated cores. So, at higher G pod density it would be possible to observe the topology fragmentation you described.

If it were configurable in the policy, do you have ideas about how you would like that to look (name for the knob, values, defaults etc.)?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it — the more I realize how complicated things actually are =/
Generally my perception is that by default CPUs of a single container inside a Pod should be as close to each other as possible, while the CPUs of a different containers inside a Pod should be as far as possible from each other. The only case I can imagine when I would want to pack as much dedicated containers on a single socket as possible is when I would want to launch multiple small (1-CPU) containers and then launch a single giant one, that asks for half the CPUs of a core.

For a container if we want to allow to control how close the CPUs are, I imagine it might be a double parameter in the pod spec. Smth like:

corePolicy:
  placement: dense|separate
  strict: true|false   

with dense and false as defaults. separate would mean to attempt to spread CPUs as much as possible and dense would mean to pack as close as possible. With strict: false meaning that workload can tolerate if the requirements are not met. And strict: true on the other side should mean that the container has to be rescheduled.

And then there is a whole new level with HT enabled. I can imagine smth like:

coreThreadPolicy:
  placement: dense|separate|separate_restrict
  strict: true|false

with dense and false as defaults. dense here would mean allow giving a container multiple sibling CPUs. separate would mean attempt to spread CPUs across non-siblings. spearate_restict would mean spread CPUs across non-siblings and reserve the sibling, so that it is not assigned to anyone.
I'm afraid that I might be over-thinking the whole problem =)

As a small disclaimer I'd like to mention that I come from OpenStack, so my perception migh be tainted with how OS does it =) Here is a link to a doc that describes how OS approaches similar problems.
(cc @ashish-billore)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teferi I like the idea of corePolicy.
As for coreThreadPolicy - isn't it a little bit too much ? i.e. not all environments are homogeneous, some nodes may not have HT enabled (probably not much). Besides separate_restrict can be accomplished by requesting i.e. 2 cores. If node has HT and

corePolicy:
  placement: dense
  strict: true

then it would result in 2 CPUs on 1 physical core.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flyingcougar separate_restrict is more about not allowing others to use core's sibling. It probably would be very difficult to implement, since k8s would still think that we have those cores available for scheduling. So anyway it was probably a weird idea.

corePolicy is about packing the CPUs as tightly as possible on cores in the same socket, while coreThreadPolicy is about packing CPUs as tightly as possible within one core. So imo both of those have their uses. I think I need to draw a couple of diagrams to illustrate these ideas...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @fabiand I guess you should participate in ^^ part of the discussion

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it — the more I realize how complicated things actually are =/

Yes :)

I'm not an expert on this domain, but …
It's pretty clear that we - kubevirt - need the 'strict dense placement' option as a first step.
I actually wonder if additional differentiations are needed: Either you care, and you require an optimal placement, or you don't and get best effort.

Some workloads, like kubevirt (again), might want to manage parts of numa nodes themselfes, in a more fine granular fashion. In those cases it might make sense, to be able to tell the kubelet, that some nodes, sockets, core, should be reserved/exclusive to a specific container. A process can then take those reserved parts and manage them as needed (i.e. in our case, libvirt would take care of managing those nodes).
Long story short, we might want a mechanism to lend nodes to a process.

Another thing is that in future we need - and thus should already think of - to combine the placement with ResourceClasses #782, because in reality you want a process close to the device it uses, to avoid numa-node remote memory access.
Closely related to this is actually also the relationship to huge-pages.

I do not think that we need to solve all of this now, but we should understand the picture, to not create something which will already contradict with what we see coming up.

@berrange @mpolednik thoughts?

Copy link
Contributor Author

@ConnorDoyle ConnorDoyle Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand there's exactly that topic on the resource management workgroup agenda for this morning. Besides the device binding logic and the hugetlb controller settings, CNI plugins could also take topology into account on a multi-socket system with multiple network interfaces. I hope we can achieve agreement that a solution is needed to unify affinity somehow in the v1.9 or v1.10 releases.

On the topic of the CPU manager itself, policies that operators want can get really complicated. Hopefully we can get a good default scoped and implemented first and then extend with more advanced policies. In the resource management face-to-face where this component was conceived, we had a pretty long discussion about implementing a more flexible policy driven by explicit pool configuration. Maybe we could think of ways to represent the behaviors needed for the advanced cases in the config domain. In my mind, a rich configurable policy may buy enough flexibility so that we don't need another plugin system for external policies. At the same time, for v1.8 at least, changes to the pod spec API are probably out of scope.


##### Options for discovering topology

1. Read and parse the virtual file [`/proc/cpuinfo`][procfs] and construct a
convenient data structure.
1. Execute a simple program like `lscpu -p` in a subprocess and construct a
convenient data structure based on the output. Here is an example of
[data structure to represent CPU topology][topo] in go. The linked package
contains code to build a ThreadSet from the output of `lscpu -p`.
1. Execute a mature external topology program like [`mpi-hwloc`][hwloc] --
potentially adding support for the hwloc file format to the Kubelet.
1. Re-use existing discovery functionality from cAdvisor. **(preferred initial
solution)**

#### CPU Manager interfaces (sketch)

```go
type State interface {
GetCPUSet(containerID string) (cpuset.CPUSet, bool)
GetDefaultCPUSet() cpuset.CPUSet
GetCPUSetOrDefault(containerID string) cpuset.CPUSet
SetCPUSet(containerID string, cpuset CPUSet)
SetDefaultCPUSet(cpuset CPUSet)
Delete(containerID string)
}

type Manager interface {
Start()
Policy() Policy
RegisterContainer(p *Pod, c *Container, containerID string) error
UnregisterContainer(containerID string) error
State() state.Reader
}

type Policy interface {
Name() string
Start(s state.State)
RegisterContainer(s State, pod *Pod, container *Container, containerID string) error
UnregisterContainer(s State, containerID string) error
}

type CPUSet map[int]struct{} // set operations and parsing/formatting helpers

type CPUTopology TBD
```

Kubernetes will ship with three CPU manager policies. Only one policy is
active at a time on a given node, chosen by the operator via Kubelet
configuration. The three policies are **no-op**, **static** and **dynamic**.
Each policy is described below.

#### Policy 1: "no-op" cpuset control [default]

This policy preserves the existing Kubelet behavior of doing nothing
with the cgroup `cpuset.cpus` and `cpuset.mems` controls. This “no-op”
policy would become the default CPU Manager policy until the effects of
the other policies are better understood.

#### Policy 2: "static" cpuset control

The "static" policy allocates exclusive CPUs for containers if they are
included in a pod of "Guaranteed" [QoS class][qos] and the container's
resource limit for the CPU resource is an integer greater than or
equal to one.

When exclusive CPUs are allocated for a container, those CPUs are
removed from the allowed CPUs of every other container running on the
node. Once allocated at pod admission time, an exclusive CPU remains
assigned to a single container for the lifetime of the pod (until it
becomes terminal.)

##### Implementation sketch

```go
func (p *staticPolicy) Start(s State) {
// Iteration starts at index `1` here because CPU `0` is reserved
// for infrastructure processes.
// TODO(CD): Improve this to align with kube/system reserved resources.
shared := NewCPUSet()
for cpuid := 1; cpuid < p.topology.NumCPUs; cpuid++ {
shared.Add(cpuid)
}
s.SetDefaultCPUSet(shared)
}

func (p *staticPolicy) RegisterContainer(s State, pod *Pod, container *Container, containerID string) error {
if numCPUs := numGuaranteedCPUs(pod, container); numCPUs != 0 {
// container should get some exclusively allocated CPUs
cpuset, err := p.allocateCPUs(s, numCPUs)
if err != nil {
return err
}
s.SetCPUSet(containerID, cpuset)
}
// container belongs in the shared pool (nothing to do; use default cpuset)
return nil
}

func (p *staticPolicy) UnregisterContainer(s State, containerID string) error {
if toRelease, ok := s.GetCPUSet(containerID); ok {
s.Delete(containerID)
p.releaseCPUs(s, toRelease)
}
return nil
}
```

##### Example pod specs and interpretation

| Pod | Interpretation |
| ------------------------------------------ | ------------------------------ |
| Pod [Guaranteed]:<br />&emsp;A:<br />&emsp;&emsp;cpu: 0.5 | Container **A** is assigned to the shared cpuset. |
| Pod [Guaranteed]:<br />&emsp;A:<br />&emsp;&emsp;cpu: 2.0 | Container **A** is assigned two sibling threads on the same physical core (HT) or two physical cores on the same socket (no HT.)<br /><br /> The shared cpuset is shrunk to make room for the exclusively allocated CPUs. |
| Pod [Guaranteed]:<br />&emsp;A:<br />&emsp;&emsp;cpu: 1.0<br />&emsp;B:<br />&emsp;&emsp;cpu: 0.5 | Container **A** is assigned one exclusive CPU and container **B** is assigned to the shared cpuset. |
| Pod [Guaranteed]:<br />&emsp;A:<br />&emsp;&emsp;cpu: 1.5<br />&emsp;B:<br />&emsp;&emsp;cpu: 0.5 | Both containers **A** and **B** are assigned to the shared cpuset. |
| Pod [Burstable] | All containers are assigned to the shared cpuset. |
| Pod [BestEffort] | All containers are assigned to the shared cpuset. |

#### Policy 3: "dynamic" cpuset control

_TODO: Describe the policy._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summarize what was discussed at f2f, and just state that this is planned for prototyping post kube 1.8.


##### Implementation sketch

```go
func (p *dynamicPolicy) Start(s State) {
// TODO
}

func (p *dynamicPolicy) RegisterContainer(s State, pod *Pod, container *Container, containerID string) error {
// TODO
}

func (p *dynamicPolicy) UnregisterContainer(s State, containerID string) error {
// TODO
}
```

##### Example pod specs and interpretation

| Pod | Interpretation |
| ------------------------------------------ | ------------------------------ |
| | |
| | |

## Operations and observability

* Checkpointing assignments
* The CPU Manager must be able to pick up where it left off in case the
Kubelet restarts for any reason.
* Read effective CPU assinments at runtime for alerting. This could be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assignments

satisfied by the checkpointing requirement.
* Configuration
* How does the CPU Manager coexist with existing kube-reserved
settings?
* How does the CPU Manager coexist with related Linux kernel
configuration (e.g. `isolcpus`.) The operator may want to specify a
low-water-mark for the size of the shared cpuset. The operator may
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PoC branch we have the static policy unconditionally reserving a core for infra processes, with banning BE if the shared pool becomes empty as a TODO. However, containers in the Burstable class also might lack cpu requests (a single-container pod that requires only memory is in the Burstable class.)

@sjenning what do you think about replacing the notion of infra-reserved cores with a min size for the shared pool in the static policy? One thing I wouldn't want is to effectively disable a significant fraction of available threads on small systems (e.g. 4 or 8 cores.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ConnorDoyle I was thinking we would remove ceiling(kube-reserved cpu + system-reserved cpu) cores from the initial cpu pool with core 0 never being in the pool. We do allow the shared pool to go to zero, at which point we evict BE pod and Burstable pods with no CPU requests, and set a CPUPressure condition true on the node that prevents the scheduler from assigning pods that don't have a cpu request.

But you are talking about the case where the user wants to ensure that the shared pool does not go to zero. I guess that value can't be implied by any tunable the kubelet exposes today. That might have to be a new one.

Does that make sense? (not rhetorical) Did I address your question? My brain has been everywhere lately.

want to correlate exclusive cores with the isolated CPUs, in which
case the strategy outlined above where allocations are taken
directly from the shared pool is too simplistic. We could allow an
explicit pool of cores that may be exclusively allocated and default
this to the shared pool (leaving at least one core fro the shared

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/fro/for

cpuset to be used for OS, infra and non-exclusive containers.

## Practical challenges

1. Synchronizing CPU Manager state with the container runtime via the
CRI. Runc/libcontainer allows container cgroup settings to be updtaed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/updtaed/updated

after creation, but neither the Kubelet docker shim nor the CRI
implement a similar interface.
1. Mitigation: [PR 46105](https://github.com/kubernetes/kubernetes/pull/46105)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is needed independent of cpu policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any update required in this text? (what are the other things that need it, VPA?)


## Implementation roadmap

### Phase 1: No-op policy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make clear kube 1.8 targets phase 1 and 2.


* Internal API exists to allocate CPUs to containers
([PR 46105](https://github.com/kubernetes/kubernetes/pull/46105))
* Kubelet configuration includes a CPU manager policy (initially only no-op)
* No-op policy is implemented.
* All existing unit and e2e tests pass.
* Initial unit tests pass.

### Phase 2: Static policy

* Kubelet can discover "basic" CPU topology (HT-to-physical-core map)
* Static policy is implemented.
* Unit tests for static policy pass.
* e2e tests for static policy pass.
* Performance metrics for one or more plausible synthetic workloads show
benefit over no-op policy.

### Phase 3: Cache allocation

* Static policy also manages [cache allocation][cat] on supported platforms.

### Phase 4: Dynamic policy

* Dynamic policy is implemented.
* Unit tests for dynamic policy pass.
* e2e tests for dynamic policy pass.
* Performance metrics for one or more plausible synthetic workloads show
benefit over no-op policy.

### Phase 5: NUMA

* Kubelet can discover "advanced" CPU topology (NUMA).

## Appendix A: cpuset pitfalls

1. `cpuset.sched_relax_domain_level`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to add a link and a gist of the effect.

1. Child cpusets must be subsets of their parents. If B is a child of A,
then B must be a subset of A. Attempting to shrink A such that B
would contain allowed CPUs not in A is not allowed (the write will
fail.) Nested cpusets must be shrunk bottom-up. By the same rationale,
nested cpusets must be expanded top-down.
1. Dynamically changing cpusets by directly writing to the sysfs would
create inconsistencies with container runtimes.
1. The `exclusive` flag. This will not be used. We will achieve
exclusivity for a CPU by removing it from all other assigned cpusets.
1. Tricky semantics when cpusets are combined with CFS shares and quota.

[cat]: http://www.intel.com/content/www/us/en/communications/cache-monitoring-cache-allocation-technologies.html
[ht]: http://www.intel.com/content/www/us/en/architecture-and-technology/hyper-threading/hyper-threading-technology.html
[hwloc]: https://www.open-mpi.org/projects/hwloc
[procfs]: http://man7.org/linux/man-pages/man5/proc.5.html
[qos]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md
[topo]: http://github.com/intelsdi-x/swan/tree/master/pkg/isolation/topo