Propose Memory Manager for NUMA awareness #1203

bg-chun · 2019-08-06T11:49:51Z

Propose Memory Manager for NUMA awareness

The goal of this feature is guaranting alignment of resources which include memory and hugepages through Topology Manager.
KEP contains Overview, Motivation, Proposal and Graduation Criteria

Related Issue

Support isolating memory to single NUMA node
Support Container Isolation of Hugepages

k8s-ci-robot · 2019-08-06T11:49:59Z

Hi @bg-chun. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ConnorDoyle · 2019-08-07T15:29:34Z

/ok-to-test

keps/sig-node/20190806-memory-manager.md

Levovar · 2019-08-07T16:52:47Z

keps/sig-node/20190806-memory-manager.md

+  - Allocatable Memory of NUMA node >= Guaranteed memory for a container.
+
+- Node affinity of hugepage can be calculated by below formulas.
+  - Available huagpages of NUMA node = Total hugepages of NUMA node - reserved by system.


*hugepages

BTW I don't think hugepages can be reserved for the system with the current configuration options. IMO Kubelet treats all hugepages on the node as available

Yes, you are right, current kubelet doesnt support to reserve huagpages for system. I dont mention about "update existing hugepages kep" in this KEP, It's my bad.
It includes "Support reserve huge pages for system on Node Allocatable feature" and "Update cAdviser to discover and store hugepages per NUMA node"

It seems that I can put extra description here for other KEP(updating hugepage) here or do not mention about hugepages that "resered by system" here.
But, I'm not sure which way is a better choice.

IMHO it not so critical, because once you use the host as Kubernetes node, you should not have an additional application on top of this host that uses huge pages.

IMHO it not so critical, because once you use the host as Kubernetes node, you should not have an additional application on top of this host that uses huge pages.

Yes, in few cases, it is not critical.
Unfortunately ,as I know most modern NFV cluster have DPDK based packet switcher such as OVS-DPDK, VPP or tungsten-fabric on their nodes.

The critical question is most of them are not integrated to kubernetes well, they are just running on bare-metal or host(i mean they are not containerized) and consuming exclusive cpu cores and hugepages of specific NUMA node.

and it will makes the issue which is generating un-reliable topolgy hint from memory manager.
I don't want to face that situation and it was the main reason why I mentioned reserved hugepages here.

Fortunately, We have good news for this issue.

we will have --reserved-cpus option for kubelet in 1.17. //thanks @jiayingz

we will have reserve flag for hugeapges in 1.18 or later //@odinuge opened PR aginst it, in fact it was the part of original hugepages kep, you can see it here.

@bg-chun thanks for the information, regarding huge pages reservation, we will have the same problem that we have with memory, how can you know huge pages from what NUMA node the system will use?

For hugepages, my idea is adding a new flag to reserve hugepages per numa node.
I think we can extend this approach(kubernetes/kubernetes#83541) later.

keps/sig-node/20190806-memory-manager.md

bg-chun · 2019-08-23T07:11:17Z

~~I will update this KEP next week.~~
=> delayed

andyzheung · 2019-09-19T02:49:53Z

Hello, I am really concern about memory NUMA awareness feature and have seen your proposal.
I don't see that numa relate about kube-scheduler should change, this issue is not care about it?
Meanwhile, I have see your proposal to add a new module in kubelet CM，memory manager. Did this have any PR, in your proposal, only can see some design or interface. Is this proposal just under concept?
@bg-chun

bg-chun · 2019-09-19T15:42:11Z

@andyzheung
This KEP proposes only node level changes.
I think kube-scheduler and pod/container spec should be improved to support NUMA in the future.
But, I would say that supporting NUMA at node level is first then we can talk about it.
That is the reason why this KEP covers only the node level.

In fact, my team already finish PoC of this feature based on pre-released versions(master, beta and rc1, 2) of 1.16-rel with Topology Manager.

But we face some issues that 1) pod isolation of huge pages, 2) CRI doesn't support huge pages and 3) cAdvisor doesn't discover pre-allocated huge pages per NUMA node.

So, I am trying to resolve the above issues to update huge pages KEP.
See below PR. It also contains code level PRs.
#1199
#1245

keps/sig-node/2019-08-06-memory-manager.md

keps/sig-node/20190806-memory-manager.md

kikisdeliveryservice

This seems to be missing the kep.yaml file. See : https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template for template.

cezaryzukowski · 2020-09-15T08:27:30Z

This seems to be missing the kep.yaml file. See : https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template for template.

@kikisdeliveryservice do you mean either:

that we must copy our current content (.md file) to a separate .yaml file, and commit this .yaml file
our current KEP does not comply with the newest template, so we must update a structure
?

kikisdeliveryservice · 2020-09-15T17:11:51Z

This seems to be missing the kep.yaml file. See : https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template for template.

@kikisdeliveryservice do you mean either:
1. that we must copy our current content (`.md` file) to a separate `.yaml` file, and commit this `.yaml` file

2. our current KEP does not comply with the newest template, so we must update a structure
   ?

Hi good question!

It's number 2. The body of your existing KEP should be in a README.md : https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md

The top portion (with authors, approvers, milestones, etc..) should be in kep.yaml see:
https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/kep.yaml

The both md and yaml files should be in a folder that would live in https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/ titled something like 1769-memory-manager-numa-awareness (your choice but just begin with 1769 it's the KEP issue number).

See an example of what you are going for here: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1967-size-memory-backed-volumes

Hope that helps!
Kirsten

kikisdeliveryservice

Thank you for the update @cezaryzukowski looks good!! One small comment

kikisdeliveryservice · 2020-09-17T17:51:16Z

keps/sig-node/1769-memory-manager/kep.yaml

+  - "@cynepco3hahue"
+owning-sig: sig-node
+participating-sigs:
+status: implemented


This should be implementable. A KEP status will only be implemented once the feature has graduated to GA. 😄

Many thanks, updated 😄

kad · 2020-09-22T15:48:19Z

keps/sig-node/1769-memory-manager/README.md

+
+For instance, consider a two-socket machine with only one size of hugepages pre-allocated. In this case, *Memory Manager* instatiates four memory tables (`[ two NUMA nodes ] * [ two  memory types ]`) on start-up. 
+
+Notice that on modern architectures, a single socket, i.e., a physical NUMA node, can be further split into a number of logical NUMA nodes, which is done to further improve the performance. For this purpose, Intel offers machines with Sub-NUMA Clustering (SNC, an upgraded version of Cluster-On-Die) feature, and AMD offers mchines with NUMA Nodes per Socket (NPS) feature. In such a scenario where SNC or NPS is enabled on a machine, a K8S node and the *Memory Manager* can detect and be aware of many NUMA nodes, although the machine features a single socket (a single physical NUMA node).


There is no such thing as "physical NUMA node". "NUMA node" is an abstraction term describing to the OS (e.g. Linux) the way how memory controller is operating. There is no strict connection on how many are per socket. It can be one per socket, it can be multiple.

kad · 2020-09-22T15:54:11Z

One of the biggest problems with this proposal, beside being limited only to Guaranteed pods, is the design that can't accommodate support for "CPU-less" NUMA nodes in Linux terms.
In cases, where NUMA node known to Linux kernel has only memory (normal of movable), it should be treated in the strict relation with CPU allocations. As result, it will be memory regions of system memory that will not be possible to consume if MM will be enabled on the system.

cbf123 · 2020-09-22T16:04:12Z

@kad Are there actual systems being built (that would be used for kubernetes) which have NUMA nodes with memory and no CPUs? I was under the impression that most modern systems had memory controllers embedded with the CPUs for speed.

derekwaynecarr

I think this is a great start to improve memory alignment.

I would like to clarify in the KEP any recommendation we would make around kernel memory accounting and its relationship to this feature and any guarantees it provides as operators that consume this feature may find that behavior surprising.

If you can add a section for metrics gathering under the beta sections, it would be extremely helpful to understand as a cluster administrator if the feature is working as expected.

derekwaynecarr · 2020-09-22T15:03:01Z

keps/sig-node/1769-memory-manager/README.md

+The *Memory Manager* is a new component of _kubelet_ ecosystem proposed to enable single-NUMA and multi-NUMA guaranteed memory allocation. Besides:
+- *Memory Manager* is also a new hint provider for _Topology Manager_;
+- The hints are intended to indicate preferred memory (and hugepages) affinity which pins the memory for a container either to a single or a group of NUMA nodes;
+- By design, *Memory Manager* offers guaranteed memory allocation for pods in Guaranteed QoS class.


I think we need to measure the impact of this feature with kernel memory accounting, or measure the impact of numa awareness minimizing the impacts we have seen with kernel memory accounting of memory. We do not offer great community guidance on if kubelet should mandate cgroup.memory=nokmem.

see here:
https://bugzilla.redhat.com/show_bug.cgi?id=1855067#c117

"We have measured that the kernel scheduler moves containers between CPU threads. Per each CPU thread use, approximately 1/2 MiB may be allocated for kernel slabs. If you do not account for this N * 1/2 MiB overhead, where N is the number of CPU threads on the node that the pod is bound against, your container may eventually be killed by the system due to out of memory errors."

related:
opencontainers/runc#1725

section 2.7:
https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

If someone can weigh in with any behavior difference with cgroup v2, it would be helpful as we look to improve guarantees around memory as these issues cause user surprise when we say "offer guaranteed memory allocation" across node sizes.

/cc @giuseppe

In the most recent issue (reported in 2020) that you mentioned above, cgroup.memory=nokmem should be used as Andrea suggested in order to disable "kernel memory accounting":

The above was measured before the OOM kill, the dump was taken after, and they
differ by 56kbytes which means all memory was tracked correctly and the bulk of
it is in 32kbyte slabs (because the kernel wasn't booted with slub_max_order=0)
for each CPU (and the only way to disable the per-memcg slab is to disable kmem
by booting the kernel with "cgroup.memory=nokmem").

But the issue related to "kernel memory accounting" popped up before (reported in 2018) and was extensively discussed [here].

In the discussion, this recommendation was provided by @dashpole to describe how to configure K8S when "kernel memory accounting" is disabled:

Follow this comment to disable kernel memory accounting by recompiling the kernel: Enabling kmem accounting can break applications on CentOS7 opencontainers/runc#1725 (comment).

I did some testing with kernel memory accounting disabled, and found that it made a relatively small impact on the ability of the kubelet to manage memory. I would recommend increasing --eviction-hard's memory.available parameter by 50Mi when disabling kernel memory accounting.

the way the kubelet manages memory is roughly to compare memory.usage_in_bytes to memory_capacity - eviction_threshold. Because memory.usage_in_bytes will not include kernel memory, the measured usage_in_bytes will be lower than the actual, and we may OOM even when usage_in_bytes is not close to 0. In my testing, I found that there was 30-50 Mi of remaining kernel memory when the node is under memory pressure. One way to compensate for this is to increase the eviction threshold.

@derekwaynecarr Would you expect us to provide similar recommendation for multi-NUMA use case? (as you already mentioned in another comment: "the multi-numa scenario increases the impact of cgroup.kmem behavior when sizing these workloads.")

I see it some serious cons to set cgroup.memory=nokmem, from the above bug:

a malicious container can take over a ton of kernel memory, much more than the memory.max limit, potentially causing a DoS to other containers or system apps

a malicious container can fragment forever the whole physical address space rendering all page blocks non-movable, so all other containers will forever run slow even after the malicious container has been destroyed

for non-malicious containers, there's still a QoS concern, where one container doing lots of files I/O with zillions of tiny files could reduce the cache available to other containers

Also important to notice that kernel 5.9 has changed from having per memory cgroup kmemcahe to use a common kmemcache for all non-root memory cgroups. As a result, we should be seeing a 50% or more reduction in kernel slab memory usage in the containers.

So for the near future, I think it will be enough to add a section under container resources regarding the guaranteed pods and kmem consumption.

We can add such notice under our KEP, but IMHO it should be placed under the k8s documentation. @derekwaynecarr WDT?

@cynepco3hahue I agree with above analysis, and a note about the more recent kernels is sufficient.

derekwaynecarr · 2020-09-22T15:14:01Z

keps/sig-node/1769-memory-manager/README.md

+![alt text](./mm_tm_diagram.png)
+
+Once *kubelet* requests a guaranteed QoS pod admission, as shown in the figure above, *Topology Manager* queries *Memory Manager* about the preferred NUMA affinity for memory and hugepages, for all containers in a pod. For each container in a pod, *Memory Manager* calculates the affinity using its internal database, namely the `Node Map`. The `Node Map` is an object which is responsible for tracking the utilization of memory (and hugepages) for all containers in Guaranteed QoS class. 
+Once *Memory Manager* completes the computations, it returns the hints to *Topology Manager* so that *Topology Manager* can figure out which NUMA node or a group of NUMA nodes are the best fit for memory pinning, for a container. The overall calculation is performed for all containers in the pod, and if none of containers is rejected, the pod becomes finally admitted and deployed.


can we clarify that the behavior for per container or per pod is based on the specified topology-manager-scope? so the behavior will vary if pod or container is provided for determining alignment?

see:
https://github.com/kubernetes/enhancements/pull/1752/files#diff-0efa13e51e66966db8daf868df3fa1afR249

Yes, you are right. The behavior of hint generation will be based on the Topology Manager Scope. We will update the KEP to reflect the scope.

In detail:

for container scope GetTopologyHints will be called to get hints for a particular container

for pod scope GetPodTopologyHints will be called to get hints for entire pod

Updated to:

Depending on Topology Manager scope (`container` or `pod`), distinct hint generation routines are used: - `GetTopologyHints` for `container` scope - `GetPodTopologyHints` for `pod` scope

derekwaynecarr · 2020-09-22T15:16:36Z

keps/sig-node/1769-memory-manager/README.md

+Once *kubelet* requests a guaranteed QoS pod admission, as shown in the figure above, *Topology Manager* queries *Memory Manager* about the preferred NUMA affinity for memory and hugepages, for all containers in a pod. For each container in a pod, *Memory Manager* calculates the affinity using its internal database, namely the `Node Map`. The `Node Map` is an object which is responsible for tracking the utilization of memory (and hugepages) for all containers in Guaranteed QoS class. 
+Once *Memory Manager* completes the computations, it returns the hints to *Topology Manager* so that *Topology Manager* can figure out which NUMA node or a group of NUMA nodes are the best fit for memory pinning, for a container. The overall calculation is performed for all containers in the pod, and if none of containers is rejected, the pod becomes finally admitted and deployed.
+
+In the admission phase, the *Memory Manager* uses `Allocate()` and updates its `Node Map` to reflect the amount of memory and hugepages requested by a container. After that, *Memory Manager* uses `AddContainer()` and enforces the consumption of memory and hugepages for the container. The enforcement limits the container's memory consumption to the NUMA node or the group of NUMA nodes indicated by *Topology Manager*. In the final step in figure above, *Memory Manager* updates `cgroups`, namely `cpuset.mems`. The *Memory Manager* will offer guaranteed memory allocation (only) for pods in Guaranteed QoS class, which is clarified in [this](#mechanism-ii-out-of-memory-oom-killer-by-kernelos) section. 


I am just noting here that this should work with present mapping for either cgroup v1 or v2 as my own personal reminder since we map either.

https://github.com/kubernetes/enhancements/pull/1370/files#diff-b8b7ada28664d1c52714060eb5a99abeR187

derekwaynecarr · 2020-09-22T15:20:50Z

keps/sig-node/1769-memory-manager/README.md

+
+- Systems such as real-time trading system or 5G CNFs (User Plain Function, UPF) employ DPDK to guarantee low-latency packet processing. DPDK (*Data Plane Development Kit*) consists of a set of libraries designed for packet processing acceleration managed in software, in the userspace. DPDK requires dedicated resources (such as exclusive CPUs, hugepages, and DPDK-compatible Network Interface Card), and most importantly, the alignment of resources to the same NUMA node so as to prevent an unexpected performance degradation due to inter-node (between NUMA nodes) communication overhead. This creates the reason to guarantee the reservation of memory (and hugepages) and other computing resources (e.g. CPU cores and NICs) from the same NUMA node, for DPDK-oriented containers. One of key KPIs in 5G deployment is Ultra-Reliable and Low Latency Communications (URLLC) which is mission-critical. This accounts for why this proposal is also critical.
+
+#### Story 2 : Databases


I think it would be good if the KubeVirt community could offer their insight as well for their use case of running VM inside a pod.

/cc @rmohr and @fabiand could you provide your use cases?

For KubeVirt, similar use-cases like Databases exist. The VMs which people bring to the cluster require special tuning for various reasons:

The workload inside the VM requires hugepages, numa placing, cpu assignment, ... In principle similar requirements like for e.g. Databases exist. We need to influence for instance the NUMA placement, so that we can pass it proplerly through to the guest, so that the workload inside the guest can be configured the right way.

The VMs themselves profit a lot from fine-tuning:

assigning full CPU cores to vCPUs inside the VM to improve performance

using hugepages as memory backing mechanism to avoid double-memory-management via the guest and the host kernel

I/O devices passthrough, where it makes sense to run on NUMA nodes nearby the device

Ther exist CNFs which require their own kernel modules, which can conflict with the host, VMs can then be used to overcome this. To configure the CNFs properly, we need to influence the NUMA placement for our VMs.

@derekwaynecarr I hope that highlights our need for all this.

Story 2 : Databases was extended by in-memory databases (require to sustain/manage memory across multiple NUMA nodes)

Story 3 : KubeVirt (provided by @rmohr) was added

derekwaynecarr · 2020-09-22T15:21:57Z

keps/sig-node/1769-memory-manager/README.md

+
+## Design Details 
+
+### How to enable the guaranteed memory allocation over many NUMA nodes?


which user story maps to this use case?

the multi-numa scenario increases the impact of cgroup.kmem behavior when sizing these workloads.

which user story maps to this use case?

Story 2 : Databases was extended by in-memory databases (require to sustain/manage memory across multiple NUMA nodes)

Please correct me, but without the memory manager containers memory pinned to all available NUMA nodes, so currently we already have a worst possible case, introducing multi-numa allocation under the memory manager should not affect it and it was definitely requested by the community.

good point.

derekwaynecarr · 2020-09-22T15:57:36Z

keps/sig-node/1769-memory-manager/README.md

+- All flags are supported, i.e.: `--memory-manager-policy`, `--reserved-memory`.
+
+#### Phase 2: Beta (target v1.21)
+- Add E2E tests.


i think this should move to alpha, otherwise, how do we know it works at all?

Okay. Now:

Add E2E tests. for Alpha

Extend E2E test coverage for Beta

derekwaynecarr · 2020-09-22T15:58:12Z

keps/sig-node/1769-memory-manager/README.md

+- Alpha-level documentation.
+- All flags are supported, i.e.: `--memory-manager-policy`, `--reserved-memory`.
+
+#### Phase 2: Beta (target v1.21)


i think improved instrumentation for this is important.

please add metrics exposed by kubelet to understand state of node memory map.

The topic of metrics is already addressed in this comment.

derekwaynecarr · 2020-09-22T15:59:11Z

keps/sig-node/1769-memory-manager/README.md

+      of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). Yes, it uses a feature gate.
+
+* **Does enabling the feature change any default behavior?**
+Yes, the admission flow changes for a pod in Guaranteed QoS class. With the Memory Manager, Topology Manager also takes into account available memory (and hugepages) to either admit a pod to the node or reject it. 


nit: available memory should be changed to allocatable memory.

the feature is not utilization aware.

Thanks. Corrected.

derekwaynecarr · 2020-09-22T16:01:43Z

keps/sig-node/1769-memory-manager/README.md

+_This section must be completed when targeting beta graduation to a release._
+
+* **How can a rollout fail? Can it impact already running workloads?**
+  Try to be as paranoid as possible - e.g., what if some components will restart


it is possible during cluster lifecycle actions (upgrade, cordon, drain, admit) that the the cluster may not be able to satisfy all pods previously running depending on the order that pods are scheduled and admitted to nodes. the feature assumes operators account for this by having workloads fit to the cluster, or sufficient excess capacity to handle, or have some other defragmentation mechanism to try to make an optimal fit.

That's right, the order can change, and pods can be also rescheduled to different hardware.

derekwaynecarr · 2020-09-22T16:02:17Z

keps/sig-node/1769-memory-manager/README.md

+
+_This section must be completed when targeting beta graduation to a release._
+
+* **How can an operator determine if the feature is in use by workloads?**


i would really like the kubelet to expose metrics about the state of the topology assignment.

The topic of metrics is already addressed in this comment.

kad · 2020-09-22T17:18:43Z

@kad Are there actual systems being built (that would be used for kubernetes) which have NUMA nodes with memory and no CPUs? I was under the impression that most modern systems had memory controllers embedded with the CPUs for speed.

Yes. It is not about memory controllers embedded or not. It is more about how the memory is organised. "NUMA node" is just a number that identifies the region that memory controller considers as unit of grouping.

As for systems, there are on the marked any recent Xeon (Skylake and up) that can support Optane DC memory.

Once it is used, it will look like something like this (:

$ grep "" /sys/devices/system/node/has*
/sys/devices/system/node/has_cpu:0-3
/sys/devices/system/node/has_memory:0-7
/sys/devices/system/node/has_normal_memory:0-7

Similarly, it applies to several years old Xeon Phi which had "CPU-less NUMA nodes" for MCDRAM (high bandwidth memory, beneficial for HPC apps).

So, in short, the Linux kernel have for the reason those entries in /sys/devices/system/node/*, to make sure that CPU cores/memory controllers mappings can be identified.

Signed-off-by: Cezary Zukowski <[email protected]>

derekwaynecarr

I appreciate the depth of the proposal.
This is a complicated topic, and I look forward to seeing what we learn as it iterates to alpha and beyond.

/approve
/lgtm

k8s-ci-robot · 2020-10-02T13:44:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bg-chun, derekwaynecarr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-node/OWNERS~~ [derekwaynecarr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 6, 2019

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 6, 2019

k8s-ci-robot requested review from dchen1107 and derekwaynecarr August 6, 2019 11:50

bg-chun changed the title ~~Memory Manager KEP Proposal~~ Propose Memory Manager for NUMA awareness Aug 6, 2019

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 6, 2019

This was referenced Aug 6, 2019

Support isolating memory to a single NUMA node kubernetes/kubernetes#81009

Closed

Support Container Isolation of Hugepages kubernetes/kubernetes#80716

Closed

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 7, 2019

Levovar reviewed Aug 7, 2019

View reviewed changes

ConnorDoyle reviewed Aug 7, 2019

View reviewed changes

keps/sig-node/20190806-memory-manager.md Outdated Show resolved Hide resolved

bg-chun mentioned this pull request Sep 3, 2019

Update huge pages KEP for container isolation of huge pages #1199

Merged

Vladimare reviewed Oct 8, 2019

View reviewed changes

keps/sig-node/2019-08-06-memory-manager.md Outdated Show resolved Hide resolved

Vladimare reviewed Oct 8, 2019

View reviewed changes

keps/sig-node/2019-08-06-memory-manager.md Outdated Show resolved Hide resolved

Vladimare reviewed Oct 8, 2019

View reviewed changes

keps/sig-node/2019-08-06-memory-manager.md Outdated Show resolved Hide resolved

Vladimare reviewed Oct 8, 2019

View reviewed changes

keps/sig-node/2019-08-06-memory-manager.md Outdated Show resolved Hide resolved

bg-chun mentioned this pull request Oct 10, 2019

REQUEST: New membership for bg-chun kubernetes/org#1287

Closed

6 tasks

bg-chun changed the title ~~Propose Memory Manager for NUMA awareness~~ [WIP]Propose Memory Manager for NUMA awareness Oct 16, 2019

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 16, 2019

wgliang reviewed Oct 23, 2019

View reviewed changes

keps/sig-node/20190806-memory-manager.md Outdated Show resolved Hide resolved

bg-chun mentioned this pull request Nov 14, 2019

The Scheduler should be Numa Aware and options should be given in the pod defination for the type of Numa awareness required. kubernetes/kubernetes#84601

Closed

cynepco3hahue reviewed Dec 9, 2019

View reviewed changes

keps/sig-node/20190806-memory-manager.md Outdated Show resolved Hide resolved

kikisdeliveryservice reviewed Sep 14, 2020

View reviewed changes

cezaryzukowski force-pushed the memory-manager-proposal branch 2 times, most recently from 89c46ab to 6aeaa5d Compare September 17, 2020 10:42

kikisdeliveryservice reviewed Sep 17, 2020

View reviewed changes

cezaryzukowski force-pushed the memory-manager-proposal branch 2 times, most recently from 313880d to 9e03cde Compare September 18, 2020 10:43

kad reviewed Sep 22, 2020

View reviewed changes

derekwaynecarr requested changes Sep 22, 2020

View reviewed changes

k8s-ci-robot requested a review from giuseppe September 22, 2020 16:05

cezaryzukowski force-pushed the memory-manager-proposal branch 4 times, most recently from 824cbe0 to d444ba5 Compare September 29, 2020 13:39

Revisions.

4388463

Signed-off-by: Cezary Zukowski <[email protected]>

cezaryzukowski force-pushed the memory-manager-proposal branch from d444ba5 to 4388463 Compare September 29, 2020 13:47

derekwaynecarr mentioned this pull request Sep 29, 2020

PodResources interface enhancements #1884

Merged

derekwaynecarr approved these changes Oct 2, 2020

View reviewed changes

k8s-ci-robot assigned derekwaynecarr Oct 2, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 2, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 2, 2020

k8s-ci-robot merged commit 39ab412 into kubernetes:master Oct 2, 2020

k8s-ci-robot added this to the v1.20 milestone Oct 2, 2020

klueska mentioned this pull request Nov 11, 2020

Memory manager kubernetes/kubernetes#95479

Merged

1 task

cynepco3hahue mentioned this pull request Mar 15, 2021

Add the memory manager as tech preview feature openshift/api#844

Closed


		For instance, consider a two-socket machine with only one size of hugepages pre-allocated. In this case, Memory Manager instatiates four memory tables (`[ two NUMA nodes ] * [ two memory types ]`) on start-up.

		Notice that on modern architectures, a single socket, i.e., a physical NUMA node, can be further split into a number of logical NUMA nodes, which is done to further improve the performance. For this purpose, Intel offers machines with Sub-NUMA Clustering (SNC, an upgraded version of Cluster-On-Die) feature, and AMD offers mchines with NUMA Nodes per Socket (NPS) feature. In such a scenario where SNC or NPS is enabled on a machine, a K8S node and the Memory Manager can detect and be aware of many NUMA nodes, although the machine features a single socket (a single physical NUMA node).


		- Systems such as real-time trading system or 5G CNFs (User Plain Function, UPF) employ DPDK to guarantee low-latency packet processing. DPDK (Data Plane Development Kit) consists of a set of libraries designed for packet processing acceleration managed in software, in the userspace. DPDK requires dedicated resources (such as exclusive CPUs, hugepages, and DPDK-compatible Network Interface Card), and most importantly, the alignment of resources to the same NUMA node so as to prevent an unexpected performance degradation due to inter-node (between NUMA nodes) communication overhead. This creates the reason to guarantee the reservation of memory (and hugepages) and other computing resources (e.g. CPU cores and NICs) from the same NUMA node, for DPDK-oriented containers. One of key KPIs in 5G deployment is Ultra-Reliable and Low Latency Communications (URLLC) which is mission-critical. This accounts for why this proposal is also critical.

		#### Story 2 : Databases


		## Design Details

		### How to enable the guaranteed memory allocation over many NUMA nodes?


		_This section must be completed when targeting beta graduation to a release._

		* How can an operator determine if the feature is in use by workloads?

Propose Memory Manager for NUMA awareness #1203

Propose Memory Manager for NUMA awareness #1203

Conversation

bg-chun commented Aug 6, 2019 • edited Loading

Propose Memory Manager for NUMA awareness

Related Issue

k8s-ci-robot commented Aug 6, 2019

ConnorDoyle commented Aug 7, 2019

Levovar Aug 7, 2019 • edited Loading

Choose a reason for hiding this comment

bg-chun Aug 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bg-chun Dec 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bg-chun commented Aug 23, 2019 • edited Loading

andyzheung commented Sep 19, 2019

bg-chun commented Sep 19, 2019

kikisdeliveryservice left a comment

Choose a reason for hiding this comment

cezaryzukowski commented Sep 15, 2020

kikisdeliveryservice commented Sep 15, 2020 • edited Loading

kikisdeliveryservice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kad commented Sep 22, 2020

cbf123 commented Sep 22, 2020

derekwaynecarr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cezaryzukowski Sep 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmohr Sep 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kad commented Sep 22, 2020

derekwaynecarr left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 2, 2020

bg-chun commented Aug 6, 2019 •

edited

Loading

Levovar Aug 7, 2019 •

edited

Loading

bg-chun Aug 8, 2019 •

edited

Loading

bg-chun Dec 9, 2019 •

edited

Loading

bg-chun commented Aug 23, 2019 •

edited

Loading

kikisdeliveryservice commented Sep 15, 2020 •

edited

Loading

cezaryzukowski Sep 28, 2020 •

edited

Loading

rmohr Sep 25, 2020 •

edited

Loading