Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2400: Graduate swap to Beta 1 #3957

Merged
merged 1 commit into from
Jun 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 78 additions & 13 deletions keps/sig-node/2400-node-swap/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Enable Swap Support only for Burstable QoS Pods](#enable-swap-support-only-for-burstable-qos-pods)
- [Set Aside Swap for System Critical Daemon](#set-aside-swap-for-system-critical-daemon)
- [Steps to Calculate Swap Limit](#steps-to-calculate-swap-limit)
- [Example](#example)
- [User Stories](#user-stories)
- [Improved Node Stability](#improved-node-stability)
- [Long-running applications that swap out startup memory](#long-running-applications-that-swap-out-startup-memory)
Expand All @@ -17,6 +21,7 @@
- [Virtualization management overhead](#virtualization-management-overhead)
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [Security risk](#security-risk)
- [Design Details](#design-details)
- [Enabling swap as an end user](#enabling-swap-as-an-end-user)
- [API Changes](#api-changes)
Expand All @@ -30,7 +35,8 @@
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Alpha2](#alpha2)
- [Beta](#beta)
- [Beta 1](#beta-1)
- [Beta 2](#beta-2)
- [GA](#ga)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
Expand Down Expand Up @@ -166,6 +172,54 @@ administrators can configure the kubelet such that:

This proposal enables scenarios 1 and 2 above, but not 3.

### Enable Swap Support only for Burstable QoS Pods
harche marked this conversation as resolved.
Show resolved Hide resolved
Before enabling swap support through the pod API, it is crucial to build confidence in this feature by carefully assessing its impact on workloads and Kubernetes. As an initial step, we propose enabling swap support for Burstable QoS Pods by automatically calculating the appropriate swap values, rather than allowing users to input these values manually.

Swap access is granted only for pods of Burstable QoS. Guaranteed QoS pods are usually higher-priority pods, therefore we want to avoid swap's performance penalty for them. Best-Effort pods, on the contrary, are low-priority pods that are the first to be killed during node pressures. In addition, they're unpredictable, therefore it's hard to assess how much swap memory is a reasonable amount to allocate for them.

By doing so, we can ensure a thorough understanding of the feature's performance and stability before considering the manual input of swap values in a subsequent beta release. This cautious approach will ensure the efficient allocation of resources and the smooth integration of swap support into Kubernetes.

Allocate the swap limit equal to the requested memory for each container and adjust the proportion of swap based on the total swap memory available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if container hasn't requested anything? It also might or might not have limit set up. Let's define the behavior in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is proposal to start with proportional approach to avoid the API change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed.
Or more generally: to avoid letting the user configure swap, which is very difficult to do and might compromise the node.


#### Set Aside Swap for System Critical Daemon

System critical daemons (such as Kubelet) are essential for node health. Usually, an appropriate portion of system resources (e.g., memory, CPU) is reserved as system reserved. However, swap doesn't inherently support reserving a portion out of the total available. For instance, in the case of memory, we set `memory.min` on the node-level cgroup to ensure an adequate amount of memory is set aside, away from the pods, and for system critical daemons. But there is no equivalent for swap; i.e., no `memory.swap.min` is supported in the kernel.

Since this proposal advocates enabling swap only for the Burstable QoS pods, this can be done by setting `memory.swap.max` on the cgroups used by the Burstable QoS pods. The value of this `memory.swap.max` can be calculated by:

memory.swap.max = total swap memory available on the system - system reserve (memory)

This is the total amount of swap available for all the Burstable QoS pods; let's call it `TotalPodsSwapAvailable`. This will ensure that the system critical daemons will have access to the swap at least equal to the system reserved memory. This will indirectly act as having support for swap in system reserved.

### Steps to Calculate Swap Limit

1. **Calculate the container's memory proportionate to the node's memory:**
- Divide the container's memory request by the total node's physical memory. Let's call this value `ContainerMemoryProportion`.
- If a container is defined with memory requests == memory limits, its `ContainerMemoryProportion` is defined as 0. Therefore, as can be seen below, its overall swap limit is also 0.

2. **Multiply the container memory proportion by the available swap memory for Pods:**
- Meaning: `ContainerMemoryProportion * TotalPodsSwapAvailable`.

#### Example
Suppose we have a Burstable QoS pod with two containers:

- Container A: Memory request 20 GB
- Container B: Memory request 10 GB

Let's assume the total physical memory is 40 GB and the total swap memory available is also 40 GB. Also assume that the system reserved memory is configured at 2GB,

Step 1: Determine the containers memory proportion:
- Container A: `20G/40G` = `0.5`.
- Container B: `10G/40G` = `0.25`.

Step 2: Determine swap limitation for the containers:
- Container A: `ContainerMemoryProportion * TotalPodsSwapAvailable` = `0.5 * 38G` = `19G`.
- Container B: `ContainerMemoryProportion * TotalPodsSwapAvailable` = `0.25 * 38G` = `9.5G`.

In this example, Container A would have a swap limit of 19 GB, and Container B would have a swap limit of 9.5 GB.

This approach allocates swap limits based on each container's memory request and adjusts the proportion based on the total swap memory available in the system. It ensures that each container gets a fair share of the swap space and helps maintain resource allocation efficiency.

### User Stories

#### Improved Node Stability
Expand Down Expand Up @@ -300,6 +354,14 @@ and/or workloads in a number of different scenarios.
Since swap provisioning is out of scope of this proposal, this enhancement
poses low risk to Kubernetes clusters that will not enable swap.

#### Security risk

Enabling swap on a system without encryption poses a security risk, as critical information, such as Kubernetes secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these secrets. To mitigate this risk, it is recommended to use encrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. Nevertheless, it is essential to provide documentation that warns users of this potential issue, ensuring they are aware of the potential security implications and can take appropriate steps to safeguard their system.
iholder101 marked this conversation as resolved.
Show resolved Hide resolved

To guarantee that system daemons are not swapped, the kubelet must configure the `memory.swap.max` setting to `0` within the system reserved cgroup. Moreover, to make sure that burstable pods are able to utilize swap space, kubelet should verify that the cgroup associated with burstable pods should not be nested under the cgroup designated for system reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In otherwords: kubeleconfig.CgroupRoot can't be a child of kubeletconfig.SystemReservedCgroups.

as an implementation note: this should probably be added to the kubelet config documentation


Additionally, end user may decide to disable swap completely for a Pod or a container in beta 1 by making Pod guaranteed or set request == limit for a container. This way, there will be no swap enabled for the corresponding containers and there will be no information exposure risks.

## Design Details

We summarize the implementation plan as following:
Expand Down Expand Up @@ -487,14 +549,14 @@ Test grid tabs enabled:

No new e2e tests introduced.

For alpha2 [Current stage]:
For alpha2:

- Add e2e tests that exercise all available swap configurations via the CRI.
- Verify MemoryPressure behavior with swap enabled and document any changes
for configuring eviction.
- Verify new system-reserved settings for swap memory.

For beta [Future]:
For beta 1:

- Add e2e tests that verify pod-level control of swap utilization.
- Add e2e tests that verify swap performance with pods using a tmpfs.
Comment on lines +559 to 562
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be happy for some clarifications regarding the test plan for Beta1.

Add e2e tests that verify pod-level control of swap utilization.

This sounds a bit outdated. Can we change it to something like Add e2e tests that verify that the right amount of swap limitation is automatically configured for burstable pods?

Add e2e tests that verify swap performance with pods using a tmpfs.

@SergeyKanzhelev would you mind clarifying that?
Which performance test do we want to do here? IOW, what do we want to verify?
Also, what does "pods using a tmpfs" means in this context?

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyKanzhelev sorry but I was also not able to gather any context around this line, Add e2e tests that verify swap performance with pods using a tmpfs. I see that it was added when swap was introduced as alpha.

Expand Down Expand Up @@ -536,21 +598,22 @@ Here are specific improvements to be made:
swap limit for workloads.
- Investigate eviction behavior with swap enabled.


#### Beta

- Add support for controlling swap consumption at the pod level [via cgroups].
- Handle usage of swap during container restart boundaries for writes to tmpfs
(which may require pod cgroup change beyond what container runtime will do at
container cgroup boundary).
#### Beta 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's limit swap for cgroupv2 -only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 👍

Copy link
Contributor

@iholder101 iholder101 Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I have a strong push-back against this, but still I would like to know, why not support both versions?

I understand that v2 is better in many ways, but should we really forcefully disallow swap usage on v1? Isn't a warning or discouraging it enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to disallow is to make sure there is a way for end users to protect themselves from a swap on specific Pods. On v1 - whatever we do, the app can still get swapped. So whatever end user will do with their Pod, there is no guarantee it wouldn't be swapped. On v2 we can provide some guarantees. Keeping in mind that v2 is something we want to encourage to use anyways, it's better to keep things safer out of the box

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that I need to update kubernetes/kubernetes#105271 to support cgroup v2 only.

For cgroup v1, I think we should just log a warning and just skip the logics.

- Enable Swap Support using Burstable QoS Pods only.
- Enable Swap Support for Cgroup v2 Only.
- Add swap memory to the Kubelet stats api.
- Determine a set of metrics for node QoS in order to evaluate the performance
of nodes with and without swap enabled.
- Better understand relationship of swap with memory QoS in cgroup v2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cgroup v2, memory.swap.high may be helpful at the node level. We can set it in the node level using a factor like the other KEP https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos for memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense to align with v2 QoS work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that using swap throttling is the right approach here.

There are significant differences between memory.high and memory.swap.high and they are used in a completely different manner.

memory.high is described as "the main mechanism to control memory usage." and "when hit, it throttles allocations by forcing them into direct reclaim to work off the excess" [1].

However, memory.swap.high is described as [1]:

Swap usage throttle limit.  If a cgroup's swap usage exceeds
this limit, all its further allocations will be throttled to
allow userspace to implement custom out-of-memory procedures.

marks a point of no return for the cgroup. It is NOT
designed to manage the amount of swapping a workload does
during regular operation. Compare to memory.swap.max, which
prohibits swapping past a set amount, but lets the cgroup
continue unimpeded as long as other memory can be reclaimed.

Healthy workloads are not expected to reach this limit.

In addition, the Linux kernel commit message that brought memory.swap.high says [2] That is not to say that the knob itself is equivalent to memory.high and that Slowing misbehaving tasks down gradually allows user space oom killers or other protection mechanisms to react.


Since we don't implement any user-space OOM killers, IIUC reaching the memory.swap.high limit would freeze the processes in the cgroup forever.

This makes sense, as swap memory isn't being reclaimed as often as ram, and it's much more expensive to do so. If processes in a cgroup would reach the memory.swap.max boundary, however, they would have to keep allocated memory in-ram. If memory.high is defined, as the KEP suggests, the processes would reach a point where they freeze up, forced to reclaim memory to continue their operation.

The bottom line is that processes that are out of memory should reclaim ram memory, not swap memory.

[1] https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v2.rst
[2] https://lore.kernel.org/linux-mm/20200602044952.g4tpmtOiL%[email protected]/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slowing misbehaving tasks down gradually allows user space oom killers or other protection mechanisms to react

IMO is a good thing. We have seen so many issues in the past where the workload would end up consuming the memory so fast that kernel won't get adequate time to react. In this scenario, the entire system would just freeze and the node would eventually transition into NOT_READY state in k8s.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two lines I'm most concerned about, and don't fully understand, are:

marks a point of no return for the cgroup.

What does "point of no return" actually means here?
Does it mean that a user-space procedure needs to resume the cgroup's execution or else it won't ever be able to allocate memory?

It is NOT designed to manage the amount of swapping a workload does during regular operation.

What are "regular operations"?

It worries me that this description is very different than the description for memory.high, which is said to be the "the main mechanism to control memory usage".

(particularly `memory.high` usage).
- Collect feedback from test user cases.
- Make sure node e2e jobs that use swap are healthy
- Improve coverage for appropriate scenarios in testgrid.

#### Beta 2
- Publish a Kubernetes doc page encoring user to use encrypted swap if they wish to enable this feature.
- Handle usage of swap during container restart boundaries for writes to tmpfs
(which may require pod cgroup change beyond what container runtime will do at
container cgroup boundary).


[via cgroups]: #restrict-swap-usage-at-the-cgroup-level

#### GA
Expand All @@ -559,6 +622,8 @@ _(Tentative.)_

- Test a wide variety of scenarios that may be affected by swap support.
- Remove feature flag.
- Remove the Swap Support using Burstable QoS Pods only deprecated in Beta 2.


### Upgrade / Downgrade Strategy

Expand Down
6 changes: 4 additions & 2 deletions keps/sig-node/2400-node-swap/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ authors:
- "@ehashman"
- "@ike-ma"
- "@SergeyKanzhelev"
- "@harche"
- "@iholder101"
owning-sig: sig-node
participating-sigs:
- sig-node
Expand All @@ -18,12 +20,12 @@ approvers:
- "@dchen1107"

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.27"
latest-milestone: "v1.28"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
Expand Down