-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CPU manager proposal. #654
Changes from 14 commits
ae38e21
a21b6ad
b961e4e
a6afac8
79a2bb5
ca32930
8714cae
6eece46
63d8db1
694d2f4
8c98636
3dfe261
a04d7eb
5780b48
6bc03ac
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,366 @@ | ||
# CPU Manager | ||
|
||
_Authors:_ | ||
|
||
* @ConnorDoyle - Connor Doyle <[email protected]> | ||
* @flyingcougar - Szymon Scharmach <[email protected]> | ||
* @sjenning - Seth Jennings <[email protected]> | ||
|
||
**Contents:** | ||
|
||
* [Overview](#overview) | ||
* [Proposed changes](#proposed-changes) | ||
* [Operations and observability](#operations-and-observability) | ||
* [Practical challenges](#practical-challenges) | ||
* [Implementation roadmap](#implementation-roadmap) | ||
* [Appendix A: cpuset pitfalls](#appendix-a-cpuset-pitfalls) | ||
|
||
## Overview | ||
|
||
_Problems to solve:_ | ||
|
||
1. Poor or unpredictable performance observed compared to virtual machine | ||
based orchestration systems. Application latency and lower CPU | ||
throughput compared to VMs due to cpu quota being fulfilled across all | ||
cores, rather than exclusive cores, which results in fewer context | ||
switches and higher cache affinity. | ||
1. Unacceptable latency attributed to the OS process scheduler, especially | ||
for “fast” virtual network functions (want to approach line rate on | ||
modern server NICs.) | ||
|
||
_Solution requirements:_ | ||
|
||
1. Provide an API-driven contract from the system to a user: "if you are a | ||
Guaranteed pod with 1 or more cores of cpu, the system will try to make | ||
sure that the pod gets its cpu quota primarily from reserved core(s), | ||
resulting in fewer context switches and higher cache affinity". | ||
1. Support the case where in a given pod, one container is latency-critical | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i assume this is the case where a pod has a container that uses integral cores and another container that uses fractional? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's how the static policy as written satisfies this requirement. Does that need clarification here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. its probably ok, i have too much history in the discussion that as phrased here, it signals something deeper. |
||
and another is not (e.g. auxillary side-car containers responsible for | ||
log forwarding, metrics collection and the like.) | ||
1. Do not cap CPU quota for guaranteed containers that are granted | ||
exclusive cores, since that would be antithetical to (1) above. | ||
1. Take physical processor topology into account in the CPU affinity policy. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: incorrect numbering. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is github markdown. When viewed as markdown, not source, the numbering is correct. |
||
|
||
### Related issues | ||
|
||
* Feature: [Further differentiate performance characteristics associated | ||
with pod level QoS](https://github.com/kubernetes/features/issues/276) | ||
|
||
## Proposed changes | ||
|
||
### CPU Manager component | ||
|
||
The *CPU Manager* is a new software component in Kubelet responsible for | ||
assigning pod containers to sets of CPUs on the local node. In later | ||
phases, the scope will expand to include caches, a critical shared | ||
processor resource. | ||
|
||
The kuberuntime notifies the CPU manager when containers come and | ||
go. The first such notification occurs in between the container runtime | ||
interface calls to create and start the container. The second notification | ||
occurs after the container is destroyed by the container runtime. The CPU | ||
Manager writes CPU settings for containers using a new CRI method named | ||
[`UpdateContainerResources`](https://github.com/kubernetes/kubernetes/pull/46105). | ||
This new method is invoked from two places in the CPU manager: during each | ||
call to `RegisterContainer` and also periodically from a separate | ||
reconciliation loop. | ||
|
||
![cpu-manager-block-diagram](https://user-images.githubusercontent.com/379372/28443427-bf1b2972-6d6a-11e7-8acb-6cbe9013ac28.png) | ||
|
||
_CPU Manager block diagram. `Policy`, `State`, and `Topology` types are | ||
factored out of the CPU Manager to promote reuse and to make it easier | ||
to build and test new policies. The shared state abstraction allows | ||
other Kubelet components to be agnostic of the CPU manager policy for | ||
observability and checkpointing extensions._ | ||
|
||
#### Discovering CPU topology | ||
|
||
The CPU Manager must understand basic topology. First of all, it must | ||
determine the number of logical CPUs (hardware threads) available for | ||
allocation. On architectures that support [hyper-threading][ht], sibling | ||
threads share a number of hardware resources including the cache | ||
hierarchy. On multi-socket systems, logical CPUs co-resident on a socket | ||
share L3 cache. Although there may be some programs that benefit from | ||
disjoint caches, the policies described in this proposal assume cache | ||
affinity will yield better application and overall system performance for | ||
most cases. In all scenarios described below, we prefer to acquire logical | ||
CPUs topologically. For example, allocating two CPUs on a system that has | ||
hyper-threading turned on yields both sibling threads on the same | ||
physical core. Likewise, allocating two CPUs on a non-hyper-threaded | ||
system yields two cores on the same socket. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What should happen if there are enough CPUs on a node, but they're far away from each other (I'm requesting 2 CPUs, but the ones left are on different cores). For some workloads admitting such Pod may be acceptable, for some not. Would it be a good idea to make this behaviour configurable? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For now, our plan is to have the first policy (static) provide the "best" allocation when assigning dedicated cores. So, at higher G pod density it would be possible to observe the topology fragmentation you described. If it were configurable in the policy, do you have ideas about how you would like that to look (name for the knob, values, defaults etc.)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The more I think about it — the more I realize how complicated things actually are =/ For a container if we want to allow to control how close the CPUs are, I imagine it might be a double parameter in the pod spec. Smth like:
with dense and false as defaults. And then there is a whole new level with HT enabled. I can imagine smth like:
with As a small disclaimer I'd like to mention that I come from OpenStack, so my perception migh be tainted with how OS does it =) Here is a link to a doc that describes how OS approaches similar problems. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @teferi I like the idea of
then it would result in 2 CPUs on 1 physical core. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @flyingcougar separate_restrict is more about not allowing others to use core's sibling. It probably would be very difficult to implement, since k8s would still think that we have those cores available for scheduling. So anyway it was probably a weird idea.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @fabiand I guess you should participate in ^^ part of the discussion There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes :) I'm not an expert on this domain, but … Some workloads, like kubevirt (again), might want to manage parts of numa nodes themselfes, in a more fine granular fashion. In those cases it might make sense, to be able to tell the kubelet, that some nodes, sockets, core, should be reserved/exclusive to a specific container. A process can then take those reserved parts and manage them as needed (i.e. in our case, libvirt would take care of managing those nodes). Another thing is that in future we need - and thus should already think of - to combine the placement with ResourceClasses #782, because in reality you want a process close to the device it uses, to avoid numa-node remote memory access. I do not think that we need to solve all of this now, but we should understand the picture, to not create something which will already contradict with what we see coming up. @berrange @mpolednik thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @fabiand there's exactly that topic on the resource management workgroup agenda for this morning. Besides the device binding logic and the hugetlb controller settings, CNI plugins could also take topology into account on a multi-socket system with multiple network interfaces. I hope we can achieve agreement that a solution is needed to unify affinity somehow in the v1.9 or v1.10 releases. On the topic of the CPU manager itself, policies that operators want can get really complicated. Hopefully we can get a good default scoped and implemented first and then extend with more advanced policies. In the resource management face-to-face where this component was conceived, we had a pretty long discussion about implementing a more flexible policy driven by explicit pool configuration. Maybe we could think of ways to represent the behaviors needed for the advanced cases in the config domain. In my mind, a rich configurable policy may buy enough flexibility so that we don't need another plugin system for external policies. At the same time, for v1.8 at least, changes to the pod spec API are probably out of scope. |
||
|
||
**Decision:** Initially the CPU Manager will re-use the existing discovery | ||
mechanism in cAdvisor. | ||
|
||
Alternate options considered for discovering topology: | ||
|
||
1. Read and parse the virtual file [`/proc/cpuinfo`][procfs] and construct a | ||
convenient data structure. | ||
1. Execute a simple program like `lscpu -p` in a subprocess and construct a | ||
convenient data structure based on the output. Here is an example of | ||
[data structure to represent CPU topology][topo] in go. The linked package | ||
contains code to build a ThreadSet from the output of `lscpu -p`. | ||
1. Execute a mature external topology program like [`mpi-hwloc`][hwloc] -- | ||
potentially adding support for the hwloc file format to the Kubelet. | ||
|
||
#### CPU Manager interfaces (sketch) | ||
|
||
```go | ||
type State interface { | ||
GetCPUSet(containerID string) (cpuset.CPUSet, bool) | ||
GetDefaultCPUSet() cpuset.CPUSet | ||
GetCPUSetOrDefault(containerID string) cpuset.CPUSet | ||
SetCPUSet(containerID string, cpuset CPUSet) | ||
SetDefaultCPUSet(cpuset CPUSet) | ||
Delete(containerID string) | ||
} | ||
|
||
type Manager interface { | ||
Start() | ||
RegisterContainer(p *Pod, c *Container, containerID string) error | ||
UnregisterContainer(containerID string) error | ||
State() state.Reader | ||
} | ||
|
||
type Policy interface { | ||
Name() string | ||
Start(s state.State) | ||
RegisterContainer(s State, pod *Pod, container *Container, containerID string) error | ||
UnregisterContainer(s State, containerID string) error | ||
} | ||
|
||
type CPUSet map[int]struct{} // set operations and parsing/formatting helpers | ||
|
||
type CPUTopology TBD | ||
``` | ||
|
||
#### Configuring the CPU Manager | ||
|
||
Kubernetes will ship with three CPU manager policies. Only one policy is | ||
active at a time on a given node, chosen by the operator via Kubelet | ||
configuration. The three policies are **noop**, **static** and **dynamic**. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. prefer we s/noop/none There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. mostly because i am not sure how well noop is understand to non-native english speakers. it reads as "nupe" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK to change it. |
||
|
||
The active CPU manager policy is set through a new Kubelet | ||
configuration value `--cpu-manager-policy`. The default value is `noop`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just recording here for future awareness, but I prefer we not prefix the name of this flag with experimental and instead state similar to node-labels that specific options are Alpha feature. |
||
|
||
The number of CPUs that pods may run on is set using the existing | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: s/is set/can be implicitly controlled/ |
||
node-allocatable configuration settings. See the [node allocatable proposal | ||
document][node-allocatable] for details. The CPU manager will claim | ||
`ceiling(node.status.allocatable.cpu)` as the number of CPUs available to | ||
assign to pods, starting from the highest-numbered physical core and | ||
descending topologically. It is recommended to configure an integer value for | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ConnorDoyle Can you please give an example here illustrating with sample topology, |
||
`node.status.allocatable.cpu` when the CPU manager is enabled. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure how to say this more precisely, but I think this might be better as the node allocatable CPU is not explicitly set by the user: It is recommended that the user configure integer CPU values for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, let's add exactly that text. It's a big improvement. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. its really the sum of kube-reserved and system-reserved should be integral. |
||
|
||
Operator documentation will be updated to explain how to configure the | ||
system to use the low-numbered physical cores for kube-reserved and | ||
system-reserved slices. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: s/slices/cgroup in case users are not on systemd. |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ConnorDoyle Will it make sense to add minimal details in this document as well for example "user-story"? |
||
Each policy is described below. | ||
|
||
#### Policy 1: "no-op" cpuset control [default] | ||
|
||
This policy preserves the existing Kubelet behavior of doing nothing | ||
with the cgroup `cpuset.cpus` and `cpuset.mems` controls. This “no-op” | ||
policy would become the default CPU Manager policy until the effects of | ||
the other policies are better understood. | ||
|
||
#### Policy 2: "static" cpuset control | ||
|
||
The "static" policy allocates exclusive CPUs for containers if they are | ||
included in a pod of "Guaranteed" [QoS class][qos] and the container's | ||
resource limit for the CPU resource is an integer greater than or | ||
equal to one. | ||
|
||
When exclusive CPUs are allocated for a container, those CPUs are | ||
removed from the allowed CPUs of every other container running on the | ||
node. Once allocated at pod admission time, an exclusive CPU remains | ||
assigned to a single container for the lifetime of the pod (until it | ||
becomes terminal.) | ||
|
||
##### Implementation sketch | ||
|
||
```go | ||
func (p *staticPolicy) Start(s State) { | ||
fullCpuset := cpuset.NewCPUSet() | ||
for cpuid := 0; cpuid < p.topology.NumCPUs; cpuid++ { | ||
fullCpuset.Add(cpuid) | ||
} | ||
// Figure out which cores shall not be used in shared pool | ||
reserved, _ := takeByTopology(p.topology, fullCpuset, p.topology.NumReservedCores) | ||
s.SetDefaultCPUSet(fullCpuset.Difference(reserved)) | ||
} | ||
|
||
func (p *staticPolicy) RegisterContainer(s State, pod *Pod, container *Container, containerID string) error { | ||
if numCPUs := numGuaranteedCPUs(pod, container); numCPUs != 0 { | ||
// container should get some exclusively allocated CPUs | ||
cpuset, err := p.allocateCPUs(s, numCPUs) | ||
if err != nil { | ||
return err | ||
} | ||
s.SetCPUSet(containerID, cpuset) | ||
} | ||
// container belongs in the shared pool (nothing to do; use default cpuset) | ||
return nil | ||
} | ||
|
||
func (p *staticPolicy) UnregisterContainer(s State, containerID string) error { | ||
if toRelease, ok := s.GetCPUSet(containerID); ok { | ||
s.Delete(containerID) | ||
p.releaseCPUs(s, toRelease) | ||
} | ||
return nil | ||
} | ||
``` | ||
|
||
##### Example pod specs and interpretation | ||
|
||
| Pod | Interpretation | | ||
| ------------------------------------------ | ------------------------------ | | ||
| Pod [Guaranteed]:<br /> A:<br />  cpu: 0.5 | Container **A** is assigned to the shared cpuset. | | ||
| Pod [Guaranteed]:<br /> A:<br />  cpu: 2.0 | Container **A** is assigned two sibling threads on the same physical core (HT) or two physical cores on the same socket (no HT.)<br /><br /> The shared cpuset is shrunk to make room for the exclusively allocated CPUs. | | ||
| Pod [Guaranteed]:<br /> A:<br />  cpu: 1.0<br /> B:<br />  cpu: 0.5 | Container **A** is assigned one exclusive CPU and container **B** is assigned to the shared cpuset. | | ||
| Pod [Guaranteed]:<br /> A:<br />  cpu: 1.5<br /> B:<br />  cpu: 0.5 | Both containers **A** and **B** are assigned to the shared cpuset. | | ||
| Pod [Burstable] | All containers are assigned to the shared cpuset. | | ||
| Pod [BestEffort] | All containers are assigned to the shared cpuset. | | ||
|
||
##### Example scenarios and interactions | ||
|
||
1. _A container arrives that requires exclusive cores._ | ||
1. Kuberuntime calls the CRI delegate to create the container. | ||
1. Kuberuntime registers the container with the CPU manager. | ||
1. CPU manager registers the container to the static policy. | ||
1. Static policy acquires CPUs from the default pool, by | ||
topological-best-fit. | ||
1. Static policy updates the state, adding an assignment for the new | ||
container and removing those CPUs from the default pool. | ||
1. CPU manager reads container assignment from the state. | ||
1. CPU manager updates the container resources via the CRI. | ||
1. Kuberuntime calls the CRI delegate to start the container. | ||
|
||
1. _A container that was assigned exclusive cores terminates._ | ||
1. Kuberuntime unregisters the container with the CPU manager. | ||
1. CPU manager unregisters the contaner with the static policy. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: container |
||
1. Static policy adds the container's assigned CPUs back to the default | ||
pool. | ||
1. Kuberuntime calls the CRI delegate to remove the container. | ||
1. Asynchronously, the CPU manager's reconcile loop updates the | ||
cpuset for all containers running in the shared pool. | ||
|
||
1. _The shared pool becomes empty._ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we previously discussed a node condition for cpu pressure. is the taint the replacement? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for pre-emption, will the taint be understood? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Apologies, lost track of the progress to convert node conditions to taints. Will update. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated |
||
1. The CPU manager adds a taint with effect NoSchedule, NoExecute | ||
that prevents BestEffort and Burstable QoS class pods from | ||
running on the node. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is reading the state and updating the container resources via CRI synchronous with "start the container"? If yes, how kuberuntime gets to know when to invoke "start"? |
||
|
||
1. _The shared pool becomes nonempty._ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: numbering. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same as before re: github markdown |
||
1. The CPU manager removes the taint with effect NoSchedule, NoExecute | ||
for BestEffort and Burstable QoS class pods. | ||
|
||
#### Policy 3: "dynamic" cpuset control | ||
|
||
_TODO: Describe the policy._ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. summarize what was discussed at f2f, and just state that this is planned for prototyping post kube 1.8. |
||
|
||
##### Implementation sketch | ||
|
||
```go | ||
func (p *dynamicPolicy) Start(s State) { | ||
// TODO | ||
} | ||
|
||
func (p *dynamicPolicy) RegisterContainer(s State, pod *Pod, container *Container, containerID string) error { | ||
// TODO | ||
} | ||
|
||
func (p *dynamicPolicy) UnregisterContainer(s State, containerID string) error { | ||
// TODO | ||
} | ||
``` | ||
|
||
##### Example pod specs and interpretation | ||
|
||
| Pod | Interpretation | | ||
| ------------------------------------------ | ------------------------------ | | ||
| | | | ||
| | | | ||
|
||
## Operations and observability | ||
|
||
* Checkpointing assignments | ||
* The CPU Manager must be able to pick up where it left off in case the | ||
Kubelet restarts for any reason. | ||
* Read effective CPU assinments at runtime for alerting. This could be | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. assignments |
||
satisfied by the checkpointing requirement. | ||
|
||
## Practical challenges | ||
|
||
1. Synchronizing CPU Manager state with the container runtime via the | ||
CRI. Runc/libcontainer allows container cgroup settings to be updated | ||
after creation, but neither the Kubelet docker shim nor the CRI | ||
implement a similar interface. | ||
1. Mitigation: [PR 46105](https://github.com/kubernetes/kubernetes/pull/46105) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is needed independent of cpu policy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any update required in this text? (what are the other things that need it, VPA?) |
||
1. Compatibility with the `isolcpus` Linux kernel boot parameter. The operator | ||
may want to correlate exclusive cores with the isolated CPUs, in which | ||
case the static policy outlined above, where allocations are taken | ||
directly from the shared pool, is too simplistic. | ||
1. Mitigation: defer supporting this until a new policy tailored for | ||
use with `isolcpus` can be added. | ||
|
||
## Implementation roadmap | ||
|
||
### Phase 1: No-op policy | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. make clear kube 1.8 targets phase 1 and 2. |
||
|
||
* Internal API exists to allocate CPUs to containers | ||
([PR 46105](https://github.com/kubernetes/kubernetes/pull/46105)) | ||
* Kubelet configuration includes a CPU manager policy (initially only no-op) | ||
* No-op policy is implemented. | ||
* All existing unit and e2e tests pass. | ||
* Initial unit tests pass. | ||
|
||
### Phase 2: Static policy | ||
|
||
* Kubelet can discover "basic" CPU topology (HT-to-physical-core map) | ||
* Static policy is implemented. | ||
* Unit tests for static policy pass. | ||
* e2e tests for static policy pass. | ||
* Performance metrics for one or more plausible synthetic workloads show | ||
benefit over no-op policy. | ||
|
||
### Phase 3: Cache allocation | ||
|
||
* Static policy also manages [cache allocation][cat] on supported platforms. | ||
|
||
### Phase 4: Dynamic policy | ||
|
||
* Dynamic policy is implemented. | ||
* Unit tests for dynamic policy pass. | ||
* e2e tests for dynamic policy pass. | ||
* Performance metrics for one or more plausible synthetic workloads show | ||
benefit over no-op policy. | ||
|
||
### Phase 5: NUMA | ||
|
||
* Kubelet can discover "advanced" CPU topology (NUMA). | ||
|
||
## Appendix A: cpuset pitfalls | ||
|
||
1. [`cpuset.sched_relax_domain_level`][cpuset-files]. "controls the width of | ||
the range of CPUs over which the kernel scheduler performs immediate | ||
rebalancing of runnable tasks across CPUs." | ||
1. Child cpusets must be subsets of their parents. If B is a child of A, | ||
then B must be a subset of A. Attempting to shrink A such that B | ||
would contain allowed CPUs not in A is not allowed (the write will | ||
fail.) Nested cpusets must be shrunk bottom-up. By the same rationale, | ||
nested cpusets must be expanded top-down. | ||
1. Dynamically changing cpusets by directly writing to the sysfs would | ||
create inconsistencies with container runtimes. | ||
1. The `exclusive` flag. This will not be used. We will achieve | ||
exclusivity for a CPU by removing it from all other assigned cpusets. | ||
1. Tricky semantics when cpusets are combined with CFS shares and quota. | ||
|
||
[cat]: http://www.intel.com/content/www/us/en/communications/cache-monitoring-cache-allocation-technologies.html | ||
[cpuset-files]: http://man7.org/linux/man-pages/man7/cpuset.7.html#FILES | ||
[ht]: http://www.intel.com/content/www/us/en/architecture-and-technology/hyper-threading/hyper-threading-technology.html | ||
[hwloc]: https://www.open-mpi.org/projects/hwloc | ||
[node-allocatable]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md#phase-2---enforce-allocatable-on-pods | ||
[procfs]: http://man7.org/linux/man-pages/man5/proc.5.html | ||
[qos]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md | ||
[topo]: http://github.com/intelsdi-x/swan/tree/master/pkg/isolation/topo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a requirement that a container must be able to discover the set of cpus it has been assigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I agree that this should be a requirement. I see the value, where people are wanting to do thread affinity inside the pod, but it comes with baggage.
It pins us in when we go to add the dynamic cpuset manager later.
It also creates problem in the kubelet restart case. If our contract is "I'll put you on an exclusive core", we have to freedom to "reschedule" on kubelet restart rather than writing a nasty cgroup parser to reconstitute the cpuset manager state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reschedule as in possibly shift a container to a new core after it's already been running? That would be rather scary for many workloads\users if that happened.
The cpuset is just a bit flag which is easy to parse?
Could the assigned cores be saved to etcd or a local cache so it's recoverable perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chino the CPU mask would only be changed after process start if the dynamic CPU Manager policy is enabled. The static policy guarantees never to do that.
@derekwaynecarr and @sjenning: Inside the container, workloads that care about their CPU mask can easily get it by reading
/proc/self/status
(see below). Is that sufficient?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a dynamic policy, we should send a send a signal. For static, I agree it's not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backfilling from off-list discussion: we talked today about providing a signal in the following way. We could project (a subset of) the CPU manager state into a volume visible to selected containers. User workloads could subscribe to update events in a normal Linux manner (e.g. inotify.) We also decided, more or less, that this won't be necessary until the dynamic policy is implemented. We can document that user containers in the shared pool must avoid setting their own CPU affinity, since it will be overwritten without notice any time the membership of the shared pool changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catching up on doc, so this may be covered elsewhere, but we should say something about signalling in a future iteration of this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add it to the dynamic policy section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added