kubernetes · ffromani · Nov 21, 2024 · sftim · Nov 23, 2024 · sftim
diff --git a/content/en/docs/concepts/policy/node-resource-managers.md b/content/en/docs/concepts/policy/node-resource-managers.md
@@ -13,10 +13,234 @@ In order to support latency-critical and high-throughput workloads, Kubernetes o
 
 <!-- body -->
 
-The main manager, the Topology Manager, is a Kubelet component that co-ordinates the overall resource management process through its [policy](/docs/tasks/administer-cluster/topology-manager/).
+## Hardware Topology Alignment policies
-## Hardware Topology Alignment policies
+## Hardware topology alignment policies
-## Hardware Topology Alignment policies
+## Hardware topology alignment policies
+
+_Topology Manager_ is a kubelet component that aims to coordinate the set of components that are
+responsible for these optimizations. The the overall resource management process is governed using
+its [policy](/docs/tasks/administer-cluster/topology-manager/).
-_Topology Manager_ is a kubelet component that aims to coordinate the set of components that are
-responsible for these optimizations. The the overall resource management process is governed using
-its [policy](/docs/tasks/administer-cluster/topology-manager/).
+the policy you specify. To learn more, read
+[Control Topology Management Policies on a Node](/docs/tasks/administer-cluster/topology-manager/).
-_Topology Manager_ is a kubelet component that aims to coordinate the set of components that are
-responsible for these optimizations. The the overall resource management process is governed using
-its [policy](/docs/tasks/administer-cluster/topology-manager/).
+the policy you specify. To learn more, read
+[Control Topology Management Policies on a Node](/docs/tasks/administer-cluster/topology-manager/).
+
+## CPU Management Policies
-## CPU Management Policies
+## Policies for assigning CPUs to Pods
-## CPU Management Policies
+## Policies for assigning CPUs to Pods
+
+{{< feature-state for_k8s_version="v1.26" state="stable" >}}
+
-
+
+Once a Pod is bound to a Node, the kubelet on that node may need to either multiplex the existing
+hardware (for example, sharing CPUs across multiple Pods) or allocate hardware by dedicating some
+resource (for example, assigning one of more CPUs for a Pod's exclusive use).
+
-
+
+Once a Pod is bound to a Node, the kubelet on that node may need to either multiplex the existing
+hardware (for example, sharing CPUs across multiple Pods) or allocate hardware by dedicating some
+resource (for example, assigning one of more CPUs for a Pod's exclusive use).
+
+By default, the kubelet uses [CFS quota](https://en.wikipedia.org/wiki/Completely_Fair_Scheduler)
+to enforce pod CPU limits.  When the node runs many CPU-bound pods, the workload can move to different CPU cores depending on
+whether the pod is throttled and which CPU cores are available at scheduling time. Many workloads are not sensitive to this migration and thus
+work fine without any intervention.
+
+However, in workloads where CPU cache affinity and scheduling latency significantly affect workload performance, the kubelet allows alternative CPU
+management policies to determine some placement preferences on the node.
+This is implemented using the _CPU Manager_ and its policy.
+There are two available policies:
+
+- `none`: the `none` policy explicitly enables the existing default CPU
+affinity scheme, providing no affinity beyond what the OS scheduler does
+automatically.  Limits on CPU usage for
+[Guaranteed pods](/docs/tasks/configure-pod-container/quality-service-pod/) and
+[Burstable pods](/docs/tasks/configure-pod-container/quality-service-pod/)
+are enforced using CFS quota.
+- `static`: the `static` policy allows containers in `Guaranteed` pods with integer CPU
+`requests` access to exclusive CPUs on the node. This exclusivity is enforced
+using the [cpuset cgroup controller](https://www.kernel.org/doc/Documentation/cgroup-v2.txt).
+
+{{< note >}}
+System services such as the container runtime and the kubelet itself can continue to run on these exclusive CPUs.  The exclusivity only extends to other pods.
+{{< /note >}}
+
+{{< note >}}
+CPU Manager doesn't support offlining and onlining of CPUs at runtime.
+{{< /note >}}
-{{< note >}}
-CPU Manager doesn't support offlining and onlining of CPUs at runtime.
-{{< /note >}}
+CPU Manager doesn't support offlining and onlining of CPUs at runtime.
-{{< note >}}
-CPU Manager doesn't support offlining and onlining of CPUs at runtime.
-{{< /note >}}
+CPU Manager doesn't support offlining and onlining of CPUs at runtime.
+
+### Static policy
+
+The static policy enables finer-grained CPU management and exclusive CPU assignment.
+This policy manages a shared pool of CPUs that initially contains all CPUs in the
+node. The amount of exclusively allocatable CPUs is equal to the total
+number of CPUs in the node minus any CPU reservations set by the kubelet configuration.
+CPUs reserved by these options are taken, in integer quantity, from the initial shared pool in ascending order by physical
+core ID.  This shared pool is the set of CPUs on which any containers in
+`BestEffort` and `Burstable` pods run. Containers in `Guaranteed` pods with fractional
+CPU `requests` also run on CPUs in the shared pool. Only containers that are
+both part of a `Guaranteed` pod and have integer CPU `requests` are assigned
+exclusive CPUs.
+
+{{< note >}}
+The kubelet requires a CPU reservation greater than zero when the static policy is enabled.
+This is because zero CPU reservation would allow the shared pool to become empty.
+{{< /note >}}
+
+As `Guaranteed` pods whose containers fit the requirements for being statically
+assigned are scheduled to the node, CPUs are removed from the shared pool and
+placed in the cpuset for the container. CFS quota is not used to bound
+the CPU usage of these containers as their usage is bound by the scheduling domain
+itself. In others words, the number of CPUs in the container cpuset is equal to the integer
+CPU `limit` specified in the pod spec. This static assignment increases CPU
+affinity and decreases context switches due to throttling for the CPU-bound
+workload.
+
+Consider the containers in the following pod specs:
+
+```yaml
+spec:
+  containers:
+  - name: nginx
+    image: nginx
+```
+
+The pod above runs in the `BestEffort` QoS class because no resource `requests` or
+`limits` are specified. It runs in the shared pool.
+
+```yaml
+spec:
+  containers:
+  - name: nginx
+    image: nginx
+    resources:
+      limits:
+        memory: "200Mi"
+      requests:
+        memory: "100Mi"
+```
+
+The pod above runs in the `Burstable` QoS class because resource `requests` do not
+equal `limits` and the `cpu` quantity is not specified. It runs in the shared
+pool.
+
+```yaml
+spec:
+  containers:
+  - name: nginx
+    image: nginx
+    resources:
+      limits:
+        memory: "200Mi"
+        cpu: "2"
+      requests:
+        memory: "100Mi"
+        cpu: "1"
+```
+
+The pod above runs in the `Burstable` QoS class because resource `requests` do not
+equal `limits`. It runs in the shared pool.
+
+```yaml
+spec:
+  containers:
+  - name: nginx
+    image: nginx
+    resources:
+      limits:
+        memory: "200Mi"
+        cpu: "2"
+      requests:
+        memory: "200Mi"
+        cpu: "2"
+```
+
+The pod above runs in the `Guaranteed` QoS class because `requests` are equal to `limits`.
+And the container's resource limit for the CPU resource is an integer greater than
+or equal to one. The `nginx` container is granted 2 exclusive CPUs.
+
+
+```yaml
+spec:
+  containers:
+  - name: nginx
+    image: nginx
+    resources:
+      limits:
+        memory: "200Mi"
+        cpu: "1.5"
+      requests:
+        memory: "200Mi"
+        cpu: "1.5"
+```
+
+The pod above runs in the `Guaranteed` QoS class because `requests` are equal to `limits`.
+But the container's resource limit for the CPU resource is a fraction. It runs in
+the shared pool.
+
+
+```yaml
+spec:
+  containers:
+  - name: nginx
+    image: nginx
+    resources:
+      limits:
+        memory: "200Mi"
+        cpu: "2"
+```
+
+The pod above runs in the `Guaranteed` QoS class because only `limits` are specified
+and `requests` are set equal to `limits` when not explicitly specified. And the
+container's resource limit for the CPU resource is an integer greater than or
+equal to one. The `nginx` container is granted 2 exclusive CPUs.
+
+#### Static policy options
-#### Static policy options
+#### Static policy options {#cpu-policy-static--options}
-#### Static policy options
+#### Static policy options {#cpu-policy-static--options}
+
+The behavior of the static policy can be fine-tuned using the CPU Manager policy options.
+The following policy options exist for the static `CPUManager` policy.
-The following policy options exist for the static `CPUManager` policy.
+
+The behavior of the static policy can be fine-tuned using CPU manager policy options.
+
+The following policy options exist for the static CPU management policy:
+{{/* options in alphabetical order */}}
+
+`align-by-socket` (alpha, hidden by default)
+: Align CPUs by physical package / socket boundary, rather than logical NUMA boundaries (available since Kubernetes v1.25)
+
+`distribute-cpus-across-cores` (alpha, hidden by default)
+: allocate virtual cores, sometimes called hardware threads, across different physical cores  (1.31 or higher)
+
+`distribute-cpus-across-numa` (alpha, hidden by default)
+: spread CPUs across different NUMA domains, aiming for an even balance between the selected domains (available since Kubernetes v1.23)
+
+`full-pcpus-only` (beta, visible by default)
+: Always allocate full physical cores (available since Kubernetes v1.22)
+
+You can toggle groups of options on and off based upon their maturity level
+using the following feature gates:
+
+* `CPUManagerPolicyBetaOptions` (default enabled). Disable to hide beta-level options.
+* `CPUManagerPolicyAlphaOptions` (default disabled). Enable to show alpha-level options.
+
+You will still have to enable each option using the `cpuManagerPolicyOptions` field in the
+kubelet configuration file.
+
+For more detail about the individual options you can configure, read on.
+
-The following policy options exist for the static `CPUManager` policy.
+
+The behavior of the static policy can be fine-tuned using CPU manager policy options.
+
+The following policy options exist for the static CPU management policy:
+{{/* options in alphabetical order */}}
+
+`align-by-socket` (alpha, hidden by default)
+: Align CPUs by physical package / socket boundary, rather than logical NUMA boundaries (available since Kubernetes v1.25)
+
+`distribute-cpus-across-cores` (alpha, hidden by default)
+: allocate virtual cores, sometimes called hardware threads, across different physical cores  (1.31 or higher)
+
+`distribute-cpus-across-numa` (alpha, hidden by default)
+: spread CPUs across different NUMA domains, aiming for an even balance between the selected domains (available since Kubernetes v1.23)
+
+`full-pcpus-only` (beta, visible by default)
+: Always allocate full physical cores (available since Kubernetes v1.22)
+
+You can toggle groups of options on and off based upon their maturity level
+using the following feature gates:
+
+* `CPUManagerPolicyBetaOptions` (default enabled). Disable to hide beta-level options.
+* `CPUManagerPolicyAlphaOptions` (default disabled). Enable to show alpha-level options.
+
+You will still have to enable each option using the `cpuManagerPolicyOptions` field in the
+kubelet configuration file.
+
+For more detail about the individual options you can configure, read on.
+
+
+##### full-pcpus-only
-##### full-pcpus-only
+##### `full-pcpus-only`
-##### full-pcpus-only
+##### `full-pcpus-only`
+
+If the `full-pcpus-only` policy option is specified, the static policy will always allocate full physical cores.
+By default, without this option, the static policy allocates CPUs using a topology-aware best-fit allocation.
+On SMT enabled systems, the policy can allocate individual virtual cores, which correspond to hardware threads.
+This can lead to different containers sharing the same physical cores; this behaviour in turn contributes
+to the [noisy neighbours problem](https://en.wikipedia.org/wiki/Cloud_computing_issues#Performance_interference_and_noisy_neighbors).
+With the option enabled, the pod will be admitted by the kubelet only if the CPU request of all its containers
+can be fulfilled by allocating full physical cores.
+If the pod does not pass the admission, it will be put in Failed state with the message `SMTAlignmentError`.
+
+##### distribute-cpus-across-numa
-##### distribute-cpus-across-numa
+##### `distribute-cpus-across-numa`
-##### distribute-cpus-across-numa
+##### `distribute-cpus-across-numa`
+
+If the `distribute-cpus-across-numa`policy option is specified, the static
+policy will evenly distribute CPUs across NUMA nodes in cases where more than
+one NUMA node is required to satisfy the allocation.
+By default, the `CPUManager` will pack CPUs onto one NUMA node until it is
+filled, with any remaining CPUs simply spilling over to the next NUMA node.
+This can cause undesired bottlenecks in parallel code relying on barriers (and
+similar synchronization primitives), as this type of code tends to run only as
+fast as its slowest worker (which is slowed down by the fact that fewer CPUs
+are available on at least one NUMA node).
+By distributing CPUs evenly across NUMA nodes, application developers can more
+easily ensure that no single worker suffers from NUMA effects more than any
+other, improving the overall performance of these types of applications.
+
+##### align-by-socket
-##### align-by-socket
+##### `align-by-socket`
-##### align-by-socket
+##### `align-by-socket`
+
+If the `align-by-socket` policy option is specified, CPUs will be considered
+aligned at the socket boundary when deciding how to allocate CPUs to a
+container. By default, the `CPUManager` aligns CPU allocations at the NUMA
+boundary, which could result in performance degradation if CPUs need to be
+pulled from more than one NUMA node to satisfy the allocation. Although it
+tries to ensure that all CPUs are allocated from the _minimum_ number of NUMA
+nodes, there is no guarantee that those NUMA nodes will be on the same socket.
+By directing the `CPUManager` to explicitly align CPUs at the socket boundary
+rather than the NUMA boundary, we are able to avoid such issues. Note, this
+policy option is not compatible with `TopologyManager` `single-numa-node`
+policy and does not apply to hardware where the number of sockets is greater
+than number of NUMA nodes.
+
+##### distribute-cpus-across-cores
-##### distribute-cpus-across-cores
+##### `distribute-cpus-across-cores`
-##### distribute-cpus-across-cores
+##### `distribute-cpus-across-cores`
+
+If the `distribute-cpus-across-cores` policy option is specified, the static policy
+will attempt to allocate virtual cores (hardware threads) across different physical cores.
+By default, the `CPUManager` tends to pack cpus onto as few physical cores as possible,
+which can lead to contention among cpus on the same physical core and result
+in performance bottlenecks. By enabling the `distribute-cpus-across-cores` policy,
+the static policy ensures that cpus are distributed across as many physical cores
+as possible, reducing the contention on the same physical core and thereby
+improving overall performance. However, it is important to note that this strategy
+might be less effective when the system is heavily loaded. Under such conditions,
+the benefit of reducing contention diminishes. Conversely, default behavior
+can help in reducing inter-core communication overhead, potentially providing
+better performance under high load conditions.
+
+## Other resource managers
 
 The configuration of individual managers is elaborated in dedicated documents:
 
-- [CPU Manager Policies](/docs/tasks/administer-cluster/cpu-management-policies/)
 - [Device Manager](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-integration-with-the-topology-manager)
 - [Memory Manager Policies](/docs/tasks/administer-cluster/memory-manager/)