-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose Memory Manager for NUMA awareness #1203
Propose Memory Manager for NUMA awareness #1203
Conversation
Hi @bg-chun. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
- Allocatable Memory of NUMA node >= Guaranteed memory for a container. | ||
|
||
- Node affinity of hugepage can be calculated by below formulas. | ||
- Available huagpages of NUMA node = Total hugepages of NUMA node - reserved by system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*hugepages
BTW I don't think hugepages can be reserved for the system with the current configuration options. IMO Kubelet treats all hugepages on the node as available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, current kubelet doesnt support to reserve huagpages for system. I don
t mention about "update existing hugepages kep" in this KEP, It's my bad.
It includes "Support reserve huge pages for system on Node Allocatable feature" and "Update cAdviser to discover and store hugepages per NUMA node"
It seems that I can put extra description here for other KEP(updating hugepage) here or do not mention about hugepages that "resered by system" here.
But, I'm not sure which way is a better choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it not so critical, because once you use the host as Kubernetes node, you should not have an additional application on top of this host that uses huge pages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it not so critical, because once you use the host as Kubernetes node, you should not have an additional application on top of this host that uses huge pages.
Yes, in few cases, it is not critical.
Unfortunately ,as I know most modern NFV cluster have DPDK based packet switcher such as OVS-DPDK, VPP or tungsten-fabric on their nodes.
The critical question is most of them are not integrated to kubernetes well, they are just running on bare-metal or host(i mean they are not containerized) and consuming exclusive cpu cores and hugepages of specific NUMA node.
and it will makes the issue which is generating un-reliable topolgy hint from memory manager.
I don't want to face that situation and it was the main reason why I mentioned reserved hugepages here.
Fortunately, We have good news for this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bg-chun thanks for the information, regarding huge pages reservation, we will have the same problem that we have with memory, how can you know huge pages from what NUMA node the system will use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For hugepages, my idea is adding a new flag to reserve hugepages per numa node.
I think we can extend this approach(kubernetes/kubernetes#83541) later.
|
Hello, I am really concern about memory NUMA awareness feature and have seen your proposal. |
@andyzheung In fact, my team already finish PoC of this feature based on pre-released versions(master, beta and rc1, 2) of 1.16-rel with Topology Manager. But we face some issues that 1) pod isolation of huge pages, 2) CRI doesn't support huge pages and 3) cAdvisor doesn't discover pre-allocated huge pages per NUMA node. So, I am trying to resolve the above issues to update huge pages KEP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be missing the kep.yaml file. See : https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template for template.
@kikisdeliveryservice do you mean either:
|
Hi good question! It's number 2. The body of your existing KEP should be in a README.md : https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md The top portion (with authors, approvers, milestones, etc..) should be in kep.yaml see: The both md and yaml files should be in a folder that would live in https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/ titled something like 1769-memory-manager-numa-awareness (your choice but just begin with 1769 it's the KEP issue number). See an example of what you are going for here: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1967-size-memory-backed-volumes Hope that helps! |
89c46ab
to
6aeaa5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the update @cezaryzukowski looks good!! One small comment
- "@cynepco3hahue" | ||
owning-sig: sig-node | ||
participating-sigs: | ||
status: implemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be implementable. A KEP status will only be implemented once the feature has graduated to GA. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks, updated 😄
313880d
to
9e03cde
Compare
|
||
For instance, consider a two-socket machine with only one size of hugepages pre-allocated. In this case, *Memory Manager* instatiates four memory tables (`[ two NUMA nodes ] * [ two memory types ]`) on start-up. | ||
|
||
Notice that on modern architectures, a single socket, i.e., a physical NUMA node, can be further split into a number of logical NUMA nodes, which is done to further improve the performance. For this purpose, Intel offers machines with Sub-NUMA Clustering (SNC, an upgraded version of Cluster-On-Die) feature, and AMD offers mchines with NUMA Nodes per Socket (NPS) feature. In such a scenario where SNC or NPS is enabled on a machine, a K8S node and the *Memory Manager* can detect and be aware of many NUMA nodes, although the machine features a single socket (a single physical NUMA node). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no such thing as "physical NUMA node". "NUMA node" is an abstraction term describing to the OS (e.g. Linux) the way how memory controller is operating. There is no strict connection on how many are per socket. It can be one per socket, it can be multiple.
One of the biggest problems with this proposal, beside being limited only to Guaranteed pods, is the design that can't accommodate support for "CPU-less" NUMA nodes in Linux terms. |
@kad Are there actual systems being built (that would be used for kubernetes) which have NUMA nodes with memory and no CPUs? I was under the impression that most modern systems had memory controllers embedded with the CPUs for speed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great start to improve memory alignment.
I would like to clarify in the KEP any recommendation we would make around kernel memory accounting and its relationship to this feature and any guarantees it provides as operators that consume this feature may find that behavior surprising.
If you can add a section for metrics gathering under the beta sections, it would be extremely helpful to understand as a cluster administrator if the feature is working as expected.
The *Memory Manager* is a new component of _kubelet_ ecosystem proposed to enable single-NUMA and multi-NUMA guaranteed memory allocation. Besides: | ||
- *Memory Manager* is also a new hint provider for _Topology Manager_; | ||
- The hints are intended to indicate preferred memory (and hugepages) affinity which pins the memory for a container either to a single or a group of NUMA nodes; | ||
- By design, *Memory Manager* offers guaranteed memory allocation for pods in Guaranteed QoS class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to measure the impact of this feature with kernel memory accounting, or measure the impact of numa awareness minimizing the impacts we have seen with kernel memory accounting of memory. We do not offer great community guidance on if kubelet should mandate cgroup.memory=nokmem.
see here:
https://bugzilla.redhat.com/show_bug.cgi?id=1855067#c117
"We have measured that the kernel scheduler moves containers between CPU threads. Per each CPU thread use, approximately 1/2 MiB may be allocated for kernel slabs. If you do not account for this N * 1/2 MiB overhead, where N is the number of CPU threads on the node that the pod is bound against, your container may eventually be killed by the system due to out of memory errors."
related:
opencontainers/runc#1725
section 2.7:
https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
If someone can weigh in with any behavior difference with cgroup v2, it would be helpful as we look to improve guarantees around memory as these issues cause user surprise when we say "offer guaranteed memory allocation" across node sizes.
/cc @giuseppe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the most recent issue (reported in 2020) that you mentioned above, cgroup.memory=nokmem
should be used as Andrea suggested in order to disable "kernel memory accounting":
The above was measured before the OOM kill, the dump was taken after, and they
differ by 56kbytes which means all memory was tracked correctly and the bulk of
it is in 32kbyte slabs (because the kernel wasn't booted with slub_max_order=0)
for each CPU (and the only way to disable the per-memcg slab is to disable kmem
by booting the kernel with "cgroup.memory=nokmem").
But the issue related to "kernel memory accounting" popped up before (reported in 2018) and was extensively discussed [here].
In the discussion, this recommendation was provided by @dashpole to describe how to configure K8S when "kernel memory accounting" is disabled:
- Follow this comment to disable kernel memory accounting by recompiling the kernel: Enabling kmem accounting can break applications on CentOS7 opencontainers/runc#1725 (comment).
- I did some testing with kernel memory accounting disabled, and found that it made a relatively small impact on the ability of the kubelet to manage memory. I would recommend increasing --eviction-hard's memory.available parameter by 50Mi when disabling kernel memory accounting.
the way the kubelet manages memory is roughly to compare memory.usage_in_bytes to memory_capacity - eviction_threshold. Because memory.usage_in_bytes will not include kernel memory, the measured usage_in_bytes will be lower than the actual, and we may OOM even when usage_in_bytes is not close to 0. In my testing, I found that there was 30-50 Mi of remaining kernel memory when the node is under memory pressure. One way to compensate for this is to increase the eviction threshold.
@derekwaynecarr Would you expect us to provide similar recommendation for multi-NUMA use case? (as you already mentioned in another comment: "the multi-numa scenario increases the impact of cgroup.kmem behavior when sizing these workloads.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it some serious cons to set cgroup.memory=nokmem
, from the above bug:
-
a malicious container can take over a ton of kernel memory, much more than the memory.max limit, potentially causing a DoS to other containers or system apps
-
a malicious container can fragment forever the whole physical address space rendering all page blocks non-movable, so all other containers will forever run slow even after the malicious container has been destroyed
-
for non-malicious containers, there's still a QoS concern, where one container doing lots of files I/O with zillions of tiny files could reduce the cache available to other containers
Also important to notice that kernel 5.9 has changed from having per memory cgroup kmemcahe to use a common kmemcache for all non-root memory cgroups. As a result, we should be seeing a 50% or more reduction in kernel slab memory usage in the containers.
So for the near future, I think it will be enough to add a section under container resources regarding the guaranteed pods and kmem consumption.
We can add such notice under our KEP, but IMHO it should be placed under the k8s documentation. @derekwaynecarr WDT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cynepco3hahue I agree with above analysis, and a note about the more recent kernels is sufficient.
![alt text](./mm_tm_diagram.png) | ||
|
||
Once *kubelet* requests a guaranteed QoS pod admission, as shown in the figure above, *Topology Manager* queries *Memory Manager* about the preferred NUMA affinity for memory and hugepages, for all containers in a pod. For each container in a pod, *Memory Manager* calculates the affinity using its internal database, namely the `Node Map`. The `Node Map` is an object which is responsible for tracking the utilization of memory (and hugepages) for all containers in Guaranteed QoS class. | ||
Once *Memory Manager* completes the computations, it returns the hints to *Topology Manager* so that *Topology Manager* can figure out which NUMA node or a group of NUMA nodes are the best fit for memory pinning, for a container. The overall calculation is performed for all containers in the pod, and if none of containers is rejected, the pod becomes finally admitted and deployed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we clarify that the behavior for per container or per pod is based on the specified topology-manager-scope? so the behavior will vary if pod or container is provided for determining alignment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. The behavior of hint generation will be based on the Topology Manager Scope. We will update the KEP to reflect the scope.
In detail:
- for container scope
GetTopologyHints
will be called to get hints for a particular container - for pod scope
GetPodTopologyHints
will be called to get hints for entire pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to:
Depending on Topology Manager scope (`container` or `pod`), distinct hint generation routines are used:
- `GetTopologyHints` for `container` scope
- `GetPodTopologyHints` for `pod` scope
Once *kubelet* requests a guaranteed QoS pod admission, as shown in the figure above, *Topology Manager* queries *Memory Manager* about the preferred NUMA affinity for memory and hugepages, for all containers in a pod. For each container in a pod, *Memory Manager* calculates the affinity using its internal database, namely the `Node Map`. The `Node Map` is an object which is responsible for tracking the utilization of memory (and hugepages) for all containers in Guaranteed QoS class. | ||
Once *Memory Manager* completes the computations, it returns the hints to *Topology Manager* so that *Topology Manager* can figure out which NUMA node or a group of NUMA nodes are the best fit for memory pinning, for a container. The overall calculation is performed for all containers in the pod, and if none of containers is rejected, the pod becomes finally admitted and deployed. | ||
|
||
In the admission phase, the *Memory Manager* uses `Allocate()` and updates its `Node Map` to reflect the amount of memory and hugepages requested by a container. After that, *Memory Manager* uses `AddContainer()` and enforces the consumption of memory and hugepages for the container. The enforcement limits the container's memory consumption to the NUMA node or the group of NUMA nodes indicated by *Topology Manager*. In the final step in figure above, *Memory Manager* updates `cgroups`, namely `cpuset.mems`. The *Memory Manager* will offer guaranteed memory allocation (only) for pods in Guaranteed QoS class, which is clarified in [this](#mechanism-ii-out-of-memory-oom-killer-by-kernelos) section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just noting here that this should work with present mapping for either cgroup v1 or v2 as my own personal reminder since we map either.
|
||
- Systems such as real-time trading system or 5G CNFs (User Plain Function, UPF) employ DPDK to guarantee low-latency packet processing. DPDK (*Data Plane Development Kit*) consists of a set of libraries designed for packet processing acceleration managed in software, in the userspace. DPDK requires dedicated resources (such as exclusive CPUs, hugepages, and DPDK-compatible Network Interface Card), and most importantly, the alignment of resources to the same NUMA node so as to prevent an unexpected performance degradation due to inter-node (between NUMA nodes) communication overhead. This creates the reason to guarantee the reservation of memory (and hugepages) and other computing resources (e.g. CPU cores and NICs) from the same NUMA node, for DPDK-oriented containers. One of key KPIs in 5G deployment is Ultra-Reliable and Low Latency Communications (URLLC) which is mission-critical. This accounts for why this proposal is also critical. | ||
|
||
#### Story 2 : Databases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For KubeVirt, similar use-cases like Databases exist. The VMs which people bring to the cluster require special tuning for various reasons:
- The workload inside the VM requires hugepages, numa placing, cpu assignment, ... In principle similar requirements like for e.g. Databases exist. We need to influence for instance the NUMA placement, so that we can pass it proplerly through to the guest, so that the workload inside the guest can be configured the right way.
- The VMs themselves profit a lot from fine-tuning:
- assigning full CPU cores to vCPUs inside the VM to improve performance
- using hugepages as memory backing mechanism to avoid double-memory-management via the guest and the host kernel
- I/O devices passthrough, where it makes sense to run on NUMA nodes nearby the device
- Ther exist CNFs which require their own kernel modules, which can conflict with the host, VMs can then be used to overcome this. To configure the CNFs properly, we need to influence the NUMA placement for our VMs.
@derekwaynecarr I hope that highlights our need for all this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Story 2 : Databases
was extended byin-memory databases
(require to sustain/manage memory across multiple NUMA nodes)Story 3 : KubeVirt (provided by @rmohr)
was added
|
||
## Design Details | ||
|
||
### How to enable the guaranteed memory allocation over many NUMA nodes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which user story maps to this use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the multi-numa scenario increases the impact of cgroup.kmem behavior when sizing these workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which user story maps to this use case?
Story 2 : Databases
was extended by in-memory databases
(require to sustain/manage memory across multiple NUMA nodes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me, but without the memory manager containers memory pinned to all available NUMA nodes, so currently we already have a worst possible case, introducing multi-numa allocation under the memory manager should not affect it and it was definitely requested by the community.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
- All flags are supported, i.e.: `--memory-manager-policy`, `--reserved-memory`. | ||
|
||
#### Phase 2: Beta (target v1.21) | ||
- Add E2E tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this should move to alpha, otherwise, how do we know it works at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Now:
Add E2E tests.
for AlphaExtend E2E test coverage
for Beta
- Alpha-level documentation. | ||
- All flags are supported, i.e.: `--memory-manager-policy`, `--reserved-memory`. | ||
|
||
#### Phase 2: Beta (target v1.21) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think improved instrumentation for this is important.
please add metrics exposed by kubelet to understand state of node memory map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The topic of metrics is already addressed in this comment.
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). Yes, it uses a feature gate. | ||
|
||
* **Does enabling the feature change any default behavior?** | ||
Yes, the admission flow changes for a pod in Guaranteed QoS class. With the Memory Manager, Topology Manager also takes into account available memory (and hugepages) to either admit a pod to the node or reject it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: available memory should be changed to allocatable memory.
the feature is not utilization aware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Corrected.
_This section must be completed when targeting beta graduation to a release._ | ||
|
||
* **How can a rollout fail? Can it impact already running workloads?** | ||
Try to be as paranoid as possible - e.g., what if some components will restart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is possible during cluster lifecycle actions (upgrade, cordon, drain, admit) that the the cluster may not be able to satisfy all pods previously running depending on the order that pods are scheduled and admitted to nodes. the feature assumes operators account for this by having workloads fit to the cluster, or sufficient excess capacity to handle, or have some other defragmentation mechanism to try to make an optimal fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, the order can change, and pods can be also rescheduled to different hardware.
|
||
_This section must be completed when targeting beta graduation to a release._ | ||
|
||
* **How can an operator determine if the feature is in use by workloads?** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would really like the kubelet to expose metrics about the state of the topology assignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The topic of metrics is already addressed in this comment.
Yes. It is not about memory controllers embedded or not. It is more about how the memory is organised. "NUMA node" is just a number that identifies the region that memory controller considers as unit of grouping. As for systems, there are on the marked any recent Xeon (Skylake and up) that can support Optane DC memory. Once it is used, it will look like something like this (:
Similarly, it applies to several years old Xeon Phi which had "CPU-less NUMA nodes" for MCDRAM (high bandwidth memory, beneficial for HPC apps). So, in short, the Linux kernel have for the reason those entries in /sys/devices/system/node/*, to make sure that CPU cores/memory controllers mappings can be identified. |
824cbe0
to
d444ba5
Compare
Signed-off-by: Cezary Zukowski <[email protected]>
d444ba5
to
4388463
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the depth of the proposal.
This is a complicated topic, and I look forward to seeing what we learn as it iterates to alpha and beyond.
/approve
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bg-chun, derekwaynecarr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Propose Memory Manager for NUMA awareness
Related Issue
Support isolating memory to single NUMA node
Support Container Isolation of Hugepages