Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose Memory Manager for NUMA awareness #1203

Merged
merged 3 commits into from
Oct 2, 2020

Conversation

bg-chun
Copy link
Member

@bg-chun bg-chun commented Aug 6, 2019

Propose Memory Manager for NUMA awareness

  • The goal of this feature is guaranting alignment of resources which include memory and hugepages through Topology Manager.
  • KEP contains Overview, Motivation, Proposal and Graduation Criteria

Related Issue

Support isolating memory to single NUMA node
Support Container Isolation of Hugepages

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 6, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @bg-chun. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 6, 2019
@bg-chun bg-chun changed the title Memory Manager KEP Proposal Propose Memory Manager for NUMA awareness Aug 6, 2019
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 6, 2019
@ConnorDoyle
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 7, 2019
keps/sig-node/20190806-memory-manager.md Outdated Show resolved Hide resolved
keps/sig-node/20190806-memory-manager.md Outdated Show resolved Hide resolved
keps/sig-node/20190806-memory-manager.md Outdated Show resolved Hide resolved
- Allocatable Memory of NUMA node >= Guaranteed memory for a container.

- Node affinity of hugepage can be calculated by below formulas.
- Available huagpages of NUMA node = Total hugepages of NUMA node - reserved by system.
Copy link

@Levovar Levovar Aug 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*hugepages

BTW I don't think hugepages can be reserved for the system with the current configuration options. IMO Kubelet treats all hugepages on the node as available

Copy link
Member Author

@bg-chun bg-chun Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, current kubelet doesnt support to reserve huagpages for system. I dont mention about "update existing hugepages kep" in this KEP, It's my bad.
It includes "Support reserve huge pages for system on Node Allocatable feature" and "Update cAdviser to discover and store hugepages per NUMA node"

It seems that I can put extra description here for other KEP(updating hugepage) here or do not mention about hugepages that "resered by system" here.
But, I'm not sure which way is a better choice.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it not so critical, because once you use the host as Kubernetes node, you should not have an additional application on top of this host that uses huge pages.

Copy link
Member Author

@bg-chun bg-chun Dec 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it not so critical, because once you use the host as Kubernetes node, you should not have an additional application on top of this host that uses huge pages.

Yes, in few cases, it is not critical.
Unfortunately ,as I know most modern NFV cluster have DPDK based packet switcher such as OVS-DPDK, VPP or tungsten-fabric on their nodes.

The critical question is most of them are not integrated to kubernetes well, they are just running on bare-metal or host(i mean they are not containerized) and consuming exclusive cpu cores and hugepages of specific NUMA node.

and it will makes the issue which is generating un-reliable topolgy hint from memory manager.
I don't want to face that situation and it was the main reason why I mentioned reserved hugepages here.

Fortunately, We have good news for this issue.

  1. we will have --reserved-cpus option for kubelet in 1.17. //thanks @jiayingz
  2. we will have reserve flag for hugeapges in 1.18 or later //@odinuge opened PR aginst it, in fact it was the part of original hugepages kep, you can see it here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bg-chun thanks for the information, regarding huge pages reservation, we will have the same problem that we have with memory, how can you know huge pages from what NUMA node the system will use?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For hugepages, my idea is adding a new flag to reserve hugepages per numa node.
I think we can extend this approach(kubernetes/kubernetes#83541) later.

keps/sig-node/20190806-memory-manager.md Outdated Show resolved Hide resolved
keps/sig-node/20190806-memory-manager.md Outdated Show resolved Hide resolved
@bg-chun
Copy link
Member Author

bg-chun commented Aug 23, 2019

I will update this KEP next week.
=> delayed

@andyzheung
Copy link

Hello, I am really concern about memory NUMA awareness feature and have seen your proposal.
I don't see that numa relate about kube-scheduler should change, this issue is not care about it?
Meanwhile, I have see your proposal to add a new module in kubelet CM,memory manager. Did this have any PR, in your proposal, only can see some design or interface. Is this proposal just under concept?
@bg-chun

@bg-chun
Copy link
Member Author

bg-chun commented Sep 19, 2019

@andyzheung
This KEP proposes only node level changes.
I think kube-scheduler and pod/container spec should be improved to support NUMA in the future.
But, I would say that supporting NUMA at node level is first then we can talk about it.
That is the reason why this KEP covers only the node level.

In fact, my team already finish PoC of this feature based on pre-released versions(master, beta and rc1, 2) of 1.16-rel with Topology Manager.

But we face some issues that 1) pod isolation of huge pages, 2) CRI doesn't support huge pages and 3) cAdvisor doesn't discover pre-allocated huge pages per NUMA node.

So, I am trying to resolve the above issues to update huge pages KEP.
See below PR. It also contains code level PRs.
#1199
#1245

@bg-chun bg-chun changed the title Propose Memory Manager for NUMA awareness [WIP]Propose Memory Manager for NUMA awareness Oct 16, 2019
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 16, 2019
Copy link
Member

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be missing the kep.yaml file. See : https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template for template.

@cezaryzukowski
Copy link

This seems to be missing the kep.yaml file. See : https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template for template.

@kikisdeliveryservice do you mean either:

  1. that we must copy our current content (.md file) to a separate .yaml file, and commit this .yaml file
  2. our current KEP does not comply with the newest template, so we must update a structure
    ?

@kikisdeliveryservice
Copy link
Member

kikisdeliveryservice commented Sep 15, 2020

This seems to be missing the kep.yaml file. See : https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template for template.

@kikisdeliveryservice do you mean either:

1. that we must copy our current content (`.md` file) to a separate `.yaml` file, and commit this `.yaml` file

2. our current KEP does not comply with the newest template, so we must update a structure
   ?

Hi good question!

It's number 2. The body of your existing KEP should be in a README.md : https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md

The top portion (with authors, approvers, milestones, etc..) should be in kep.yaml see:
https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/kep.yaml

The both md and yaml files should be in a folder that would live in https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/ titled something like 1769-memory-manager-numa-awareness (your choice but just begin with 1769 it's the KEP issue number).

See an example of what you are going for here: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1967-size-memory-backed-volumes

Hope that helps!
Kirsten

@cezaryzukowski cezaryzukowski force-pushed the memory-manager-proposal branch 2 times, most recently from 89c46ab to 6aeaa5d Compare September 17, 2020 10:42
Copy link
Member

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update @cezaryzukowski looks good!! One small comment

- "@cynepco3hahue"
owning-sig: sig-node
participating-sigs:
status: implemented

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be implementable. A KEP status will only be implemented once the feature has graduated to GA. 😄

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks, updated 😄

@cezaryzukowski cezaryzukowski force-pushed the memory-manager-proposal branch 2 times, most recently from 313880d to 9e03cde Compare September 18, 2020 10:43

For instance, consider a two-socket machine with only one size of hugepages pre-allocated. In this case, *Memory Manager* instatiates four memory tables (`[ two NUMA nodes ] * [ two memory types ]`) on start-up.

Notice that on modern architectures, a single socket, i.e., a physical NUMA node, can be further split into a number of logical NUMA nodes, which is done to further improve the performance. For this purpose, Intel offers machines with Sub-NUMA Clustering (SNC, an upgraded version of Cluster-On-Die) feature, and AMD offers mchines with NUMA Nodes per Socket (NPS) feature. In such a scenario where SNC or NPS is enabled on a machine, a K8S node and the *Memory Manager* can detect and be aware of many NUMA nodes, although the machine features a single socket (a single physical NUMA node).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no such thing as "physical NUMA node". "NUMA node" is an abstraction term describing to the OS (e.g. Linux) the way how memory controller is operating. There is no strict connection on how many are per socket. It can be one per socket, it can be multiple.

@kad
Copy link
Member

kad commented Sep 22, 2020

One of the biggest problems with this proposal, beside being limited only to Guaranteed pods, is the design that can't accommodate support for "CPU-less" NUMA nodes in Linux terms.
In cases, where NUMA node known to Linux kernel has only memory (normal of movable), it should be treated in the strict relation with CPU allocations. As result, it will be memory regions of system memory that will not be possible to consume if MM will be enabled on the system.

@cbf123
Copy link

cbf123 commented Sep 22, 2020

@kad Are there actual systems being built (that would be used for kubernetes) which have NUMA nodes with memory and no CPUs? I was under the impression that most modern systems had memory controllers embedded with the CPUs for speed.

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great start to improve memory alignment.

I would like to clarify in the KEP any recommendation we would make around kernel memory accounting and its relationship to this feature and any guarantees it provides as operators that consume this feature may find that behavior surprising.

If you can add a section for metrics gathering under the beta sections, it would be extremely helpful to understand as a cluster administrator if the feature is working as expected.

The *Memory Manager* is a new component of _kubelet_ ecosystem proposed to enable single-NUMA and multi-NUMA guaranteed memory allocation. Besides:
- *Memory Manager* is also a new hint provider for _Topology Manager_;
- The hints are intended to indicate preferred memory (and hugepages) affinity which pins the memory for a container either to a single or a group of NUMA nodes;
- By design, *Memory Manager* offers guaranteed memory allocation for pods in Guaranteed QoS class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to measure the impact of this feature with kernel memory accounting, or measure the impact of numa awareness minimizing the impacts we have seen with kernel memory accounting of memory. We do not offer great community guidance on if kubelet should mandate cgroup.memory=nokmem.

see here:
https://bugzilla.redhat.com/show_bug.cgi?id=1855067#c117

"We have measured that the kernel scheduler moves containers between CPU threads. Per each CPU thread use, approximately 1/2 MiB may be allocated for kernel slabs. If you do not account for this N * 1/2 MiB overhead, where N is the number of CPU threads on the node that the pod is bound against, your container may eventually be killed by the system due to out of memory errors."

related:
opencontainers/runc#1725

section 2.7:
https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

If someone can weigh in with any behavior difference with cgroup v2, it would be helpful as we look to improve guarantees around memory as these issues cause user surprise when we say "offer guaranteed memory allocation" across node sizes.

/cc @giuseppe

Copy link

@cezaryzukowski cezaryzukowski Sep 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the most recent issue (reported in 2020) that you mentioned above, cgroup.memory=nokmem should be used as Andrea suggested in order to disable "kernel memory accounting":

The above was measured before the OOM kill, the dump was taken after, and they
differ by 56kbytes which means all memory was tracked correctly and the bulk of
it is in 32kbyte slabs (because the kernel wasn't booted with slub_max_order=0)
for each CPU (and the only way to disable the per-memcg slab is to disable kmem
by booting the kernel with "cgroup.memory=nokmem").

But the issue related to "kernel memory accounting" popped up before (reported in 2018) and was extensively discussed [here].

In the discussion, this recommendation was provided by @dashpole to describe how to configure K8S when "kernel memory accounting" is disabled:

  1. Follow this comment to disable kernel memory accounting by recompiling the kernel: Enabling kmem accounting can break applications on CentOS7 opencontainers/runc#1725 (comment).
  2. I did some testing with kernel memory accounting disabled, and found that it made a relatively small impact on the ability of the kubelet to manage memory. I would recommend increasing --eviction-hard's memory.available parameter by 50Mi when disabling kernel memory accounting.

the way the kubelet manages memory is roughly to compare memory.usage_in_bytes to memory_capacity - eviction_threshold. Because memory.usage_in_bytes will not include kernel memory, the measured usage_in_bytes will be lower than the actual, and we may OOM even when usage_in_bytes is not close to 0. In my testing, I found that there was 30-50 Mi of remaining kernel memory when the node is under memory pressure. One way to compensate for this is to increase the eviction threshold.

@derekwaynecarr Would you expect us to provide similar recommendation for multi-NUMA use case? (as you already mentioned in another comment: "the multi-numa scenario increases the impact of cgroup.kmem behavior when sizing these workloads.")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it some serious cons to set cgroup.memory=nokmem, from the above bug:

  • a malicious container can take over a ton of kernel memory, much more than the memory.max limit, potentially causing a DoS to other containers or system apps

  • a malicious container can fragment forever the whole physical address space rendering all page blocks non-movable, so all other containers will forever run slow even after the malicious container has been destroyed

  • for non-malicious containers, there's still a QoS concern, where one container doing lots of files I/O with zillions of tiny files could reduce the cache available to other containers

Also important to notice that kernel 5.9 has changed from having per memory cgroup kmemcahe to use a common kmemcache for all non-root memory cgroups. As a result, we should be seeing a 50% or more reduction in kernel slab memory usage in the containers.

So for the near future, I think it will be enough to add a section under container resources regarding the guaranteed pods and kmem consumption.

We can add such notice under our KEP, but IMHO it should be placed under the k8s documentation. @derekwaynecarr WDT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cynepco3hahue I agree with above analysis, and a note about the more recent kernels is sufficient.

![alt text](./mm_tm_diagram.png)

Once *kubelet* requests a guaranteed QoS pod admission, as shown in the figure above, *Topology Manager* queries *Memory Manager* about the preferred NUMA affinity for memory and hugepages, for all containers in a pod. For each container in a pod, *Memory Manager* calculates the affinity using its internal database, namely the `Node Map`. The `Node Map` is an object which is responsible for tracking the utilization of memory (and hugepages) for all containers in Guaranteed QoS class.
Once *Memory Manager* completes the computations, it returns the hints to *Topology Manager* so that *Topology Manager* can figure out which NUMA node or a group of NUMA nodes are the best fit for memory pinning, for a container. The overall calculation is performed for all containers in the pod, and if none of containers is rejected, the pod becomes finally admitted and deployed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we clarify that the behavior for per container or per pod is based on the specified topology-manager-scope? so the behavior will vary if pod or container is provided for determining alignment?

see:
https://github.com/kubernetes/enhancements/pull/1752/files#diff-0efa13e51e66966db8daf868df3fa1afR249

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. The behavior of hint generation will be based on the Topology Manager Scope. We will update the KEP to reflect the scope.

In detail:

  • for container scope GetTopologyHints will be called to get hints for a particular container
  • for pod scope GetPodTopologyHints will be called to get hints for entire pod

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to:

Depending on Topology Manager scope (`container` or `pod`), distinct hint generation routines are used:  
- `GetTopologyHints` for `container` scope
- `GetPodTopologyHints` for `pod` scope

Once *kubelet* requests a guaranteed QoS pod admission, as shown in the figure above, *Topology Manager* queries *Memory Manager* about the preferred NUMA affinity for memory and hugepages, for all containers in a pod. For each container in a pod, *Memory Manager* calculates the affinity using its internal database, namely the `Node Map`. The `Node Map` is an object which is responsible for tracking the utilization of memory (and hugepages) for all containers in Guaranteed QoS class.
Once *Memory Manager* completes the computations, it returns the hints to *Topology Manager* so that *Topology Manager* can figure out which NUMA node or a group of NUMA nodes are the best fit for memory pinning, for a container. The overall calculation is performed for all containers in the pod, and if none of containers is rejected, the pod becomes finally admitted and deployed.

In the admission phase, the *Memory Manager* uses `Allocate()` and updates its `Node Map` to reflect the amount of memory and hugepages requested by a container. After that, *Memory Manager* uses `AddContainer()` and enforces the consumption of memory and hugepages for the container. The enforcement limits the container's memory consumption to the NUMA node or the group of NUMA nodes indicated by *Topology Manager*. In the final step in figure above, *Memory Manager* updates `cgroups`, namely `cpuset.mems`. The *Memory Manager* will offer guaranteed memory allocation (only) for pods in Guaranteed QoS class, which is clarified in [this](#mechanism-ii-out-of-memory-oom-killer-by-kernelos) section.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just noting here that this should work with present mapping for either cgroup v1 or v2 as my own personal reminder since we map either.

https://github.com/kubernetes/enhancements/pull/1370/files#diff-b8b7ada28664d1c52714060eb5a99abeR187


- Systems such as real-time trading system or 5G CNFs (User Plain Function, UPF) employ DPDK to guarantee low-latency packet processing. DPDK (*Data Plane Development Kit*) consists of a set of libraries designed for packet processing acceleration managed in software, in the userspace. DPDK requires dedicated resources (such as exclusive CPUs, hugepages, and DPDK-compatible Network Interface Card), and most importantly, the alignment of resources to the same NUMA node so as to prevent an unexpected performance degradation due to inter-node (between NUMA nodes) communication overhead. This creates the reason to guarantee the reservation of memory (and hugepages) and other computing resources (e.g. CPU cores and NICs) from the same NUMA node, for DPDK-oriented containers. One of key KPIs in 5G deployment is Ultra-Reliable and Low Latency Communications (URLLC) which is mission-critical. This accounts for why this proposal is also critical.

#### Story 2 : Databases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good if the KubeVirt community could offer their insight as well for their use case of running VM inside a pod.

/cc @rmohr and @fabiand could you provide your use cases?

Copy link

@rmohr rmohr Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For KubeVirt, similar use-cases like Databases exist. The VMs which people bring to the cluster require special tuning for various reasons:

  • The workload inside the VM requires hugepages, numa placing, cpu assignment, ... In principle similar requirements like for e.g. Databases exist. We need to influence for instance the NUMA placement, so that we can pass it proplerly through to the guest, so that the workload inside the guest can be configured the right way.
  • The VMs themselves profit a lot from fine-tuning:
    • assigning full CPU cores to vCPUs inside the VM to improve performance
    • using hugepages as memory backing mechanism to avoid double-memory-management via the guest and the host kernel
    • I/O devices passthrough, where it makes sense to run on NUMA nodes nearby the device
  • Ther exist CNFs which require their own kernel modules, which can conflict with the host, VMs can then be used to overcome this. To configure the CNFs properly, we need to influence the NUMA placement for our VMs.

@derekwaynecarr I hope that highlights our need for all this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Story 2 : Databases was extended by in-memory databases (require to sustain/manage memory across multiple NUMA nodes)
  2. Story 3 : KubeVirt (provided by @rmohr) was added


## Design Details

### How to enable the guaranteed memory allocation over many NUMA nodes?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which user story maps to this use case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the multi-numa scenario increases the impact of cgroup.kmem behavior when sizing these workloads.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which user story maps to this use case?

Story 2 : Databases was extended by in-memory databases (require to sustain/manage memory across multiple NUMA nodes)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me, but without the memory manager containers memory pinned to all available NUMA nodes, so currently we already have a worst possible case, introducing multi-numa allocation under the memory manager should not affect it and it was definitely requested by the community.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

- All flags are supported, i.e.: `--memory-manager-policy`, `--reserved-memory`.

#### Phase 2: Beta (target v1.21)
- Add E2E tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should move to alpha, otherwise, how do we know it works at all?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Now:

  • Add E2E tests. for Alpha
  • Extend E2E test coverage for Beta

- Alpha-level documentation.
- All flags are supported, i.e.: `--memory-manager-policy`, `--reserved-memory`.

#### Phase 2: Beta (target v1.21)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think improved instrumentation for this is important.

please add metrics exposed by kubelet to understand state of node memory map.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The topic of metrics is already addressed in this comment.

of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). Yes, it uses a feature gate.

* **Does enabling the feature change any default behavior?**
Yes, the admission flow changes for a pod in Guaranteed QoS class. With the Memory Manager, Topology Manager also takes into account available memory (and hugepages) to either admit a pod to the node or reject it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: available memory should be changed to allocatable memory.

the feature is not utilization aware.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Corrected.

_This section must be completed when targeting beta graduation to a release._

* **How can a rollout fail? Can it impact already running workloads?**
Try to be as paranoid as possible - e.g., what if some components will restart
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is possible during cluster lifecycle actions (upgrade, cordon, drain, admit) that the the cluster may not be able to satisfy all pods previously running depending on the order that pods are scheduled and admitted to nodes. the feature assumes operators account for this by having workloads fit to the cluster, or sufficient excess capacity to handle, or have some other defragmentation mechanism to try to make an optimal fit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, the order can change, and pods can be also rescheduled to different hardware.


_This section must be completed when targeting beta graduation to a release._

* **How can an operator determine if the feature is in use by workloads?**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would really like the kubelet to expose metrics about the state of the topology assignment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The topic of metrics is already addressed in this comment.

@kad
Copy link
Member

kad commented Sep 22, 2020

@kad Are there actual systems being built (that would be used for kubernetes) which have NUMA nodes with memory and no CPUs? I was under the impression that most modern systems had memory controllers embedded with the CPUs for speed.

Yes. It is not about memory controllers embedded or not. It is more about how the memory is organised. "NUMA node" is just a number that identifies the region that memory controller considers as unit of grouping.

As for systems, there are on the marked any recent Xeon (Skylake and up) that can support Optane DC memory.

Once it is used, it will look like something like this (:

$ grep "" /sys/devices/system/node/has*
/sys/devices/system/node/has_cpu:0-3
/sys/devices/system/node/has_memory:0-7
/sys/devices/system/node/has_normal_memory:0-7

Similarly, it applies to several years old Xeon Phi which had "CPU-less NUMA nodes" for MCDRAM (high bandwidth memory, beneficial for HPC apps).

So, in short, the Linux kernel have for the reason those entries in /sys/devices/system/node/*, to make sure that CPU cores/memory controllers mappings can be identified.

@cezaryzukowski cezaryzukowski force-pushed the memory-manager-proposal branch 4 times, most recently from 824cbe0 to d444ba5 Compare September 29, 2020 13:39
Signed-off-by: Cezary Zukowski <[email protected]>
Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the depth of the proposal.
This is a complicated topic, and I look forward to seeing what we learn as it iterates to alpha and beyond.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 2, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bg-chun, derekwaynecarr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 2, 2020
@k8s-ci-robot k8s-ci-robot merged commit 39ab412 into kubernetes:master Oct 2, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.