From a1941aad0f162fd0320806b6c8faee6ec93b2a73 Mon Sep 17 00:00:00 2001
From: Elana Hashman <ehashman@redhat.com>
Date: Tue, 6 Apr 2021 15:21:48 -0700
Subject: [PATCH 1/9] Add draft for node swap KEP

---
 keps/sig-node/2400-node-swap/README.md | 632 +++++++++++++++++++++++++
 keps/sig-node/2400-node-swap/kep.yaml  |  38 ++
 2 files changed, 670 insertions(+)
 create mode 100644 keps/sig-node/2400-node-swap/README.md
 create mode 100644 keps/sig-node/2400-node-swap/kep.yaml
diff --git a/keps/sig-node/2400-node-swap/README.md b/keps/sig-node/2400-node-swap/README.md
new file mode 100644
index 00000000000..3386e3fa890
--- /dev/null
+++ b/keps/sig-node/2400-node-swap/README.md
@@ -0,0 +1,632 @@
+# KEP-2400: Node system swap support
+
+<!--
+This is the title of your KEP. Keep it short, simple, and descriptive. A good
+title can help communicate what the KEP is and should be considered as part of
+any review.
+-->
+
+<!--
+A table of contents is helpful for quickly jumping to sections of a KEP and for
+highlighting any additional information provided beyond the standard KEP
+template.
+
+Ensure the TOC is wrapped with
+  <code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
+tags, and then generate with `hack/update-toc.sh`.
+-->
+
+<!-- toc -->
+- [Release Signoff Checklist](#release-signoff-checklist)
+- [Summary](#summary)
+- [Motivation](#motivation)
+  - [Scenarios](#scenarios)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
+- [Proposal](#proposal)
+  - [User Stories](#user-stories)
+    - [Improved Node Stability](#improved-node-stability)
+    - [Long-running applications that swap out startup memory](#long-running-applications-that-swap-out-startup-memory)
+    - [Memory Flexibility](#memory-flexibility)
+    - [Local development and systems with fast storage](#local-development-and-systems-with-fast-storage)
+    - [Low footprint systems](#low-footprint-systems)
+    - [Virtualization management overhead](#virtualization-management-overhead)
+  - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
+  - [Risks and Mitigations](#risks-and-mitigations)
+- [Design Details](#design-details)
+  - [Test Plan](#test-plan)
+  - [Graduation Criteria](#graduation-criteria)
+    - [Alpha](#alpha)
+    - [Beta](#beta)
+    - [GA](#ga)
+  - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
+  - [Version Skew Strategy](#version-skew-strategy)
+- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
+  - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
+  - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
+  - [Monitoring Requirements](#monitoring-requirements)
+  - [Dependencies](#dependencies)
+  - [Scalability](#scalability)
+  - [Troubleshooting](#troubleshooting)
+- [Implementation History](#implementation-history)
+- [Drawbacks](#drawbacks)
+- [Alternatives](#alternatives)
+  - [Just set <code>--fail-swap-on=false</code>](#just-set-)
+- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
+<!-- /toc -->
+
+## Release Signoff Checklist
+
+<!--
+**ACTION REQUIRED:** In order to merge code into a release, there must be an
+issue in [kubernetes/enhancements] referencing this KEP and targeting a release
+milestone **before the [Enhancement Freeze](https://git.k8s.io/sig-release/releases)
+of the targeted release**.
+
+For enhancements that make changes to code or processes/procedures in core
+Kubernetes—i.e., [kubernetes/kubernetes], we require the following Release
+Signoff checklist to be completed.
+
+Check these off as they are completed for the Release Team to track. These
+checklist items _must_ be updated for the enhancement to be released.
+-->
+
+Items marked with (R) are required *prior to targeting to a milestone / release*.
+
+- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
+- [ ] (R) KEP approvers have approved the KEP status as `implementable`
+- [ ] (R) Design details are appropriately documented
+- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
+- [ ] (R) Graduation criteria is in place
+- [ ] (R) Production readiness review completed
+- [ ] (R) Production readiness review approved
+- [ ] "Implementation History" section is up-to-date for milestone
+- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
+
+<!--
+**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
+-->
+
+[kubernetes.io]: https://kubernetes.io/
+[kubernetes/enhancements]: https://git.k8s.io/enhancements
+[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
+[kubernetes/website]: https://git.k8s.io/website
+
+## Summary
+
+Kubernetes currently does not support the use of [swap memory](https://en.wikipedia.org/wiki/Paging#Linux) on Linux, as it is difficult to provide guarantees and account for pod memory utilization when swap is involved. As part of Kubernetes’ earlier design, [swap support was considered out of scope](https://github.com/kubernetes/kubernetes/issues/7294).
+
+However, there are a [number of use cases](#user-stories) that would benefit from Kubernetes nodes supporting swap. Hence, this proposal aims to add swap support to nodes in a controlled, predictable manner so that Kubernetes users can perform testing and provide data to continue building cluster capabilities on top of swap.
+
+## Motivation
+
+There are two distinct types of user for swap, who may overlap:
+- node administrators, who may want swap available for node-level performance tuning and stability/reducing noisy neighbour issues
+- application developers, who have written applications that would benefit from using swap memory
+
+There are hence a number of possible ways that one could envision swap use on a node.
+
+### Scenarios
+
+1. Swap is enabled only at the system level. The CRI does not permit user workloads to use swap. (This scenario is a prerequisite for the following use cases.)
+1. Swap is enabled at the node level. The CRI can be globally configured to permit user workloads scheduled on the node to use some quantity of swap.
+1. Swap is set on a per-workload basis. The CRI sets permitted swap utilization on each individual workload.
+
+This KEP will be limited in scope to the first two scenarios. The third can be addressed in a follow-up KEP. The enablement work that is in scope for this KEP will be necessary to implement the third scenario.
+
+
+### Goals
+
+- On Linux systems, when swap is provisioned and available, Kubelet can start up with swap on.
+- Configuration is available for CRI to set swap utilization available to Kubernetes workloads, defaulting to 0 swap.
+- Cluster administrators can enable and configure CRI swap utilization on a per-node basis.
+
+### Non-Goals
+
+- Provisioning swap. Swap must already be available on the system.
+- Allocating swap on a per-workload basis with accounting (e.g. pod-level specification of swap). If desired, this should be designed and implemented as part of a follow-up KEP. This KEP is a prerequisite for that work.
+- Supporting zram, zswap, or other memory types like SGX EPC. These could be addressed in a follow-up KEP, and are out of scope.
+
+## Proposal
+
+I propose that, when swap is provisioned and available on a node, we allow cluster administrators to configure the Kubelet and CRI such that:
+
+- The kubelet can start with swap on.
+- The CRI is updated such that by default, workloads will use 0 swap.
+- The CRI will have configuration available such that swap utilization can be configured for the entire node (e.g. as a percentage of pod memory requests).
+
+This proposal enables scenarios 1 and 2 above, but not 3.
+
+### User Stories
+
+#### Improved Node Stability
+
+cgroupsv2 improved memory management algos, such as oomd, currently require swap. Hence, having a small amount of swap available on nodes could improve better resource pressure handling and recovery.
+
+- https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#id1
+- https://chrisdown.name/2018/01/02/in-defence-of-swap.html
+- https://media.ccc.de/v/ASG2018-175-oomd
+- https://github.com/facebookincubator/oomd/blob/master/docs/production_setup.md#swap
+
+This user story is addressed by scenario 1 and 2, and could benefit from 3.
+
+#### Long-running applications that swap out startup memory
+
+- Applications such as the Java and Node runtimes rely on swap for optimal performance https://github.com/kubernetes/kubernetes/issues/53533#issue-263475425
+- Initialization logic of applications can be safely swapped out without affecting long-running application resource usage https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-615967154
+
+This user story is addressed by scenario 2, and could benefit from 3.
+
+#### Memory Flexibility
+
+This user story addresses cases in which cost of additional memory is prohibitive, or elastic scaling is impossible (e.g. on-premise/bare metal deployments).
+
+- Occasional cron job with high memory usage and lack of swap support means cloud nodes must always be allocated for maximum possible memory utilization, leading to overprovisioning/high costs https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-354832960
+- Lack of swap support would require provisioning 3x the amount of memory as required with swap https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-617654228
+- On-premise deployment can’t horizontally scale available memory based on load https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-637715138
+- Scaling resources is technically feasible but cost-prohibitive, swap provides flexibility at lower cost https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-553713502
+
+This user story is addressed by scenario 2, and could benefit from 3.
+
+#### Local development and systems with fast storage
+
+Local development or single-node clusters and systems with fast storage may benefit from using available swap (e.g. NVMe swap partitions, one-node clusters).
+
+- Single node, local Kubernetes deployment on laptop https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-361748518
+- Linux has optimizations for swap on SSD, allowing for performance boosts https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-589275277
+
+This user story is addressed by scenarios 1 and 2, and could benefit from 3.
+
+#### Low footprint systems
+
+For example, edge devices with limited memory.
+
+- Edge compute systems/devices with small memory footprints (\<2Gi) https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-751398086
+- Clusters with nodes \<4Gi memory https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-751404417
+
+This user story is addressed by scenario 2, and could benefit from 3.
+
+#### Virtualization management overhead
+
+This would apply to virtualized Kubernetes workloads such as VMs launched by kubevirt.
+
+Every VM comes with a management related overhead which can sporadically be pretty significant (memory streaming, SRIOV attachment, gpu attachment, virtio-fs, …). Swap helps to not request much more memory to deal with short term worst-case scenarios.
+
+With virtualization, clusters are typically provisioned based on the workloads’ memory consumption, and any infrastructure container overhead is overcommitted. This overhead could be safely swapped out.
+
+- Required for live migration of VMs https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-754878431
+
+This user story is addressed by scenario 2, and could benefit from 3.
+
+### Notes/Constraints/Caveats (Optional)
+
+<!--
+What are the caveats to the proposal?
+What are some important details that didn't come across above?
+Go in to as much detail as necessary here.
+This might be a good place to talk about core concepts and how they relate.
+-->
+
+### Risks and Mitigations
+
+Having swap available on a system reduces predictability. When swap is available to workloads, and is not accounted for on an individual workload-by-workload basis
+
+First, this risk is mitigated by preventing any workloads from using swap by default, even if it is enabled on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization.
+
+Additionally, we mitigate this risk by quantifying system stability and then gathering test and production data to determine if system stability remains the same or is improved when swap is available to the system and/or workloads.
+
+Since swap provisioning is out of scope of this proposal, this enhancement poses little risk to Kubernetes clusters that will not enable swap.
+
+## Design Details
+
+\[In progress\]
+
+Need to add specifics here for:
+
+- Changes to `--fail-on-swap` flag
+- CRI config details
+- Where changes will need to be made so that dockershim and the CRI are consistent with swap control
+
+### Test Plan
+
+For alpha:
+
+- Swap scenarios are enabled in test-infra for at least two Linux distributions. e2e suites will be run against them.
+- Data should be gathered from a number of use cases to guide beta graduation and further development efforts.
+
+Once this data is available, additional test plans should be added for the next phase of graduation.
+
+### Graduation Criteria
+
+#### Alpha
+
+- Kubelet can be started with swap enabled.
+- KubeletConfig allows CRI to be configured with a percentage of swap available to workloads. This will default to 0.
+- e2e test jobs are configured for Linux systems with swap enabled.
+
+
+#### Beta
+
+(Tentative.)
+
+- Determine a set of metrics for node QoS in order to evaluate the performance of nodes with and without swap enabled.
+- Collect feedback from test user cases.
+- Improve coverage for appropriate scenarios in testgrid.
+
+#### GA
+
+- Remove feature flag.
+
+### Upgrade / Downgrade Strategy
+
+<!--
+If applicable, how will the component be upgraded and downgraded? Make sure
+this is in the test plan.
+
+Consider the following in developing an upgrade/downgrade strategy for this
+enhancement:
+- What changes (in invocations, configurations, API use, etc.) is an existing
+  cluster required to make on upgrade, in order to maintain previous behavior?
+- What changes (in invocations, configurations, API use, etc.) is an existing
+  cluster required to make on upgrade, in order to make use of the enhancement?
+-->
+
+No changes are required on upgrade to maintain previous behaviour.
+
+It is possible to downgrade a kubelet on a node that was using swap, but this would require disabling the use of swap and setting `swapoff` on the node.
+
+### Version Skew Strategy
+
+<!--
+If applicable, how will the component handle version skew with other
+components? What are the guarantees? Make sure this is in the test plan.
+
+Consider the following in developing a version skew strategy for this
+enhancement:
+- Does this enhancement involve coordinating behavior in the control plane and
+  in the kubelet? How does an n-2 kubelet without this feature available behave
+  when this feature is used?
+- Will any other components on the node change? For example, changes to CSI,
+  CRI or CNI may require updating that component before the kubelet.
+-->
+
+Feature flag will apply to kubelet only, so version skew strategy is N/A.
+
+## Production Readiness Review Questionnaire
+
+<!--
+
+Production readiness reviews are intended to ensure that features merging into
+Kubernetes are observable, scalable and supportable; can be safely operated in
+production environments, and can be disabled or rolled back in the event they
+cause increased failures in production. See more in the PRR KEP at
+https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
+
+The production readiness review questionnaire must be completed and approved
+for the KEP to move to `implementable` status and be included in the release.
+
+In some cases, the questions below should also have answers in `kep.yaml`. This
+is to enable automation to verify the presence of the review, and to reduce review
+burden and latency.
+
+The KEP must have a approver from the
+[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
+team. Please reach out on the
+[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
+you need any help or guidance.
+-->
+
+### Feature Enablement and Rollback
+
+<!--
+This section must be completed when targeting alpha to a release.
+-->
+
+###### How can this feature be enabled / disabled in a live cluster?
+
+<!--
+Pick one of these and delete the rest.
+-->
+
+- [x] Feature gate (also fill in values in `kep.yaml`)
+  - Feature gate name: NodeSwapEnabled
+  - Components depending on the feature gate: Kubelet
+- [ ] Other
+  - Describe the mechanism:
+  - Will enabling / disabling the feature require downtime of the control
+    plane?
+  - Will enabling / disabling the feature require downtime or reprovisioning
+    of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
+
+###### Does enabling the feature change any default behavior?
+
+<!--
+Any change of default behavior may be surprising to users or break existing
+automations, so be extremely careful here.
+-->
+
+No. If the feature flag is enabled, the user must still set `--fail-swap-on=false` to adjust the default behaviour.
+
+A node must have swap provisioned and available for this feature to work. If there is no swap available, but the feature flag is set to true, there will still be no change in existing behaviour.
+
+###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
+
+<!--
+Describe the consequences on existing workloads (e.g., if this is a runtime
+feature, can it break the existing applications?).
+
+NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
+-->
+
+No. I don’t think it makes much sense to be able to provide users a meaningful ability to disable the feature flag at runtime, as this would be highly disruptive to workloads and difficult to implement. To turn this off, the kubelet would need to be restarted.
+
+###### What happens if we reenable the feature if it was previously rolled back?
+
+N/A
+
+###### Are there any tests for feature enablement/disablement?
+
+<!--
+The e2e framework does not currently support enabling or disabling feature
+gates. However, unit tests in each component dealing with managing data, created
+with and without the feature, are necessary. At the very least, think about
+conversion tests if API types are being modified.
+-->
+
+N/A. This should be tested separately for scenarios with the flag enabled and disabled.
+
+### Rollout, Upgrade and Rollback Planning
+
+<!--
+This section must be completed when targeting beta to a release.
+-->
+
+###### How can a rollout fail? Can it impact already running workloads?
+
+<!--
+Try to be as paranoid as possible - e.g., what if some components will restart
+mid-rollout?
+-->
+
+###### What specific metrics should inform a rollback?
+
+<!--
+What signals should users be paying attention to when the feature is young
+that might indicate a serious problem?
+-->
+
+###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
+
+<!--
+Describe manual testing that was done and the outcomes.
+Longer term, we may want to require automated upgrade/rollback tests, but we
+are missing a bunch of machinery and tooling and can't do that now.
+-->
+
+###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
+
+<!--
+Even if applying deprecation policies, they may still surprise some users.
+-->
+
+### Monitoring Requirements
+
+<!--
+This section must be completed when targeting beta to a release.
+-->
+
+###### How can an operator determine if the feature is in use by workloads?
+
+<!--
+Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
+checking if there are objects with field X set) may be a last resort. Avoid
+logs or events for this purpose.
+-->
+
+###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
+
+<!--
+Pick one more of these and delete the rest.
+-->
+
+- [ ] Metrics
+  - Metric name:
+  - [Optional] Aggregation method:
+  - Components exposing the metric:
+- [ ] Other (treat as last resort)
+  - Details:
+
+###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
+
+<!--
+At a high level, this usually will be in the form of "high percentile of SLI
+per day <= X". It's impossible to provide comprehensive guidance, but at the very
+high level (needs more precise definitions) those may be things like:
+  - per-day percentage of API calls finishing with 5XX errors <= 1%
+  - 99% percentile over day of absolute value from (job creation time minus expected
+    job creation time) for cron job <= 10%
+  - 99,9% of /health requests per day finish with 200 code
+-->
+
+###### Are there any missing metrics that would be useful to have to improve observability of this feature?
+
+<!--
+Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
+implementation difficulties, etc.).
+-->
+
+### Dependencies
+
+<!--
+This section must be completed when targeting beta to a release.
+-->
+
+###### Does this feature depend on any specific services running in the cluster?
+
+<!--
+Think about both cluster-level services (e.g. metrics-server) as well
+as node-level agents (e.g. specific version of CRI). Focus on external or
+optional services that are needed. For example, if this feature depends on
+a cloud provider API, or upon an external software-defined storage or network
+control plane.
+
+For each of these, fill in the following—thinking about running existing user workloads
+and creating new ones, as well as about cluster-level services (e.g. DNS):
+  - [Dependency name]
+    - Usage description:
+      - Impact of its outage on the feature:
+      - Impact of its degraded performance or high-error rates on the feature:
+-->
+
+No.
+
+### Scalability
+
+<!--
+For alpha, this section is encouraged: reviewers should consider these questions
+and attempt to answer them.
+
+For beta, this section is required: reviewers must answer these questions.
+
+For GA, this section is required: approvers should be able to confirm the
+previous answers based on experience in the field.
+-->
+
+###### Will enabling / using this feature result in any new API calls?
+
+<!--
+Describe them, providing:
+  - API call type (e.g. PATCH pods)
+  - estimated throughput
+  - originating component(s) (e.g. Kubelet, Feature-X-controller)
+Focusing mostly on:
+  - components listing and/or watching resources they didn't before
+  - API calls that may be triggered by changes of some Kubernetes resources
+    (e.g. update of object X triggers new updates of object Y)
+  - periodic API calls to reconcile state (e.g. periodic fetching state,
+    heartbeats, leader election, etc.)
+-->
+
+No.
+
+###### Will enabling / using this feature result in introducing new API types?
+
+<!--
+Describe them, providing:
+  - API type
+  - Supported number of objects per cluster
+  - Supported number of objects per namespace (for namespace-scoped objects)
+-->
+
+No.
+
+###### Will enabling / using this feature result in any new calls to the cloud provider?
+
+<!--
+Describe them, providing:
+  - Which API(s):
+  - Estimated increase:
+-->
+
+No.
+
+###### Will enabling / using this feature result in increasing size or count of the existing API objects?
+
+<!--
+Describe them, providing:
+  - API type(s):
+  - Estimated increase in size: (e.g., new annotation of size 32B)
+  - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
+-->
+
+The KubeletConfig API object may slightly increase in size due to new config fields.
+
+###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
+
+<!--
+Look at the [existing SLIs/SLOs].
+
+Think about adding additional work or introducing new steps in between
+(e.g. need to do X to start a container), etc. Please describe the details.
+
+[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
+-->
+
+It is possible for this feature to affect performance of some worker node-level SLIs/SLOs. We will need to monitor for differences, particularly during beta testing, when evaluating this feature for beta and graduation.
+
+###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
+
+<!--
+Things to keep in mind include: additional in-memory state, additional
+non-trivial computations, excessive access to disks (including increased log
+volume), significant amount of data sent and/or received over network, etc.
+This through this both in small and large cases, again with respect to the
+[supported limits].
+
+[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
+-->
+
+Yes. It will permit the utilization of swap memory (i.e. disk) on nodes. This is expected, as this enhancement is enabling cluster administrators to access this resource.
+
+### Troubleshooting
+
+<!--
+This section must be completed when targeting beta to a release.
+
+The Troubleshooting section currently serves the `Playbook` role. We may consider
+splitting it into a dedicated `Playbook` document (potentially with some monitoring
+details). For now, we leave it here.
+-->
+
+###### How does this feature react if the API server and/or etcd is unavailable?
+
+###### What are other known failure modes?
+
+<!--
+For each of them, fill in the following information by copying the below template:
+  - [Failure mode brief description]
+    - Detection: How can it be detected via metrics? Stated another way:
+      how can an operator troubleshoot without logging into a master or worker node?
+    - Mitigations: What can be done to stop the bleeding, especially for already
+      running user workloads?
+    - Diagnostics: What are the useful log messages and their required logging
+      levels that could help debug the issue?
+      Not required until feature graduated to beta.
+    - Testing: Are there any tests for failure mode? If not, describe why.
+-->
+
+###### What steps should be taken if SLOs are not being met to determine the problem?
+
+## Implementation History
+
+- **2015-04-24:** Discussed in [#7294](https://github.com/kubernetes/kubernetes/issues/7294).
+- **2017-10-06:** Discussed in [#53533](https://github.com/kubernetes/kubernetes/issues/53533).
+- **2021-01-05:** Initial design discussion document for swap support and use cases.
+- **2021-04-05:** Alpha KEP drafted for initial node-level swap support and implementation (KEP-2400).
+
+## Drawbacks
+
+<!--
+Why should this KEP _not_ be implemented?
+-->
+
+When swap is enabled, particularly for workloads, the kubelet’s resource accounting may become much less accurate. This may make cluster administration more difficult and less predictable.
+
+Currently, there exists an unsupported workaround, which is setting the kubelet flag `--fail-swap-on` to false.
+
+## Alternatives
+
+### Just set `--fail-swap-on=false`
+
+This is insufficient for most use cases because there is inconsistent control over how swap will be used by various container runtimes. Dockershim currently sets swap available for workloads to 0. The CRI does not restrict it at all. This inconsistency makes it difficult or impossible to use swap in production, particularly if a user wants to restrict workloads from using swap when using the CRI rather than dockershim.
+
+## Infrastructure Needed (Optional)
+
+<!--
+Use this section if you need things from the project/SIG. Examples include a
+new subproject, repos requested, or GitHub details. Listing these here allows a
+SIG to get the process for these resources started right away.
+-->
+
+We may need Linux VM images built with swap partitions for e2e testing in CI.
diff --git a/keps/sig-node/2400-node-swap/kep.yaml b/keps/sig-node/2400-node-swap/kep.yaml
new file mode 100644
index 00000000000..72e9ea2bebd
--- /dev/null
+++ b/keps/sig-node/2400-node-swap/kep.yaml
@@ -0,0 +1,38 @@
+title: Node system swap support
+kep-number: 2400
+authors:
+  - "@ehashman"
+owning-sig: sig-node
+participating-sigs:
+  - sig-node
+status: provisional
+creation-date: 2021-04-06
+reviewers:
+  - TBD
+approvers:
+  - "@derekwaynecarr"
+  - "@dchen1107"
+prr-approvers:
+  - "@johnbelamaric"
+
+# The target maturity stage in the current dev cycle for this KEP.
+stage: alpha
+
+# The most recent milestone for which work toward delivery of this KEP has been
+# done. This can be the current (upcoming) milestone, if it is being actively
+# worked on.
+latest-milestone: "v1.22"
+
+# The milestone at which this feature was, or is targeted to be, at each stage.
+milestone:
+  alpha: "v1.22"
+  beta: "v1.23"
+  stable: "v1.24"
+
+# The following PRR answers are required at alpha release
+# List the feature gate name and the components for which it must be enabled
+feature-gates:
+  - name: NodeSwapEnabled
+    components:
+      - kubelet
+disable-supported: false

From 3577d17d4e4b06366a01faf4ed756c28c6bae30e Mon Sep 17 00:00:00 2001
From: Ike Ma <ike.shibin.ma@gmail.com>
Date: Wed, 28 Apr 2021 17:09:07 -0700
Subject: [PATCH 2/9] Implementation details from Ike Ma

Co-Authored-By: Elana Hashman <ehashman@redhat.com>
---
 keps/sig-node/2400-node-swap/README.md | 109 +++++++++++++++++++++++--
 keps/sig-node/2400-node-swap/kep.yaml  |   1 +
 2 files changed, 103 insertions(+), 7 deletions(-)

diff --git a/keps/sig-node/2400-node-swap/README.md b/keps/sig-node/2400-node-swap/README.md
index 3386e3fa890..a3835f50cba 100644
--- a/keps/sig-node/2400-node-swap/README.md
+++ b/keps/sig-node/2400-node-swap/README.md
@@ -220,13 +220,108 @@ Since swap provisioning is out of scope of this proposal, this enhancement poses
 
 ## Design Details
 
-\[In progress\]
-
-Need to add specifics here for:
-
-- Changes to `--fail-on-swap` flag
-- CRI config details
-- Where changes will need to be made so that dockershim and the CRI are consistent with swap control
+### TL;DR
+
+In a nutshell, the following implementation are planned for Memory Swap Support
+in 1.22 GKE alpha
+
+1. Having a feature gate `SupportNodeMemorySwap` guarding against the memory
+   swap support feature
+2. Keep the default value of kubelet flag `--fail-on-swap` to `true` in order
+   to minimize the blast radius
+3. Introducing two new kubelet config `MemorySwapLimit` and `Swappiness`
+4. Introducing two new CRI parameter `memory_swap_limit_in_bytes` and `memory_swappiness`
+5. End to end wiring from kubelet config file to CRI
+
+### Expected User Behaviour
+
+For alpha, the feature gate `SupportNodeMemorySwap` is default to disabled, and
+`--fail-on-swap` flag value is the same as 1.21. Therefore, from Kubernetes
+user’s perspective, no behavior changes out of the box.
+
+For users that are ready to explore the Memory Swap feature in 1.22 Alpha, they
+will need to complete the following steps
+
+1. provision swap enable `SupportNodeMemorySwap` flag AND
+2. set `--fail-on-swap` flag to `false`
+
+Then, the user can start experimenting/fine tuning kubelet configuration
+`MemorySwapLimit` and/or `Swappiness` and observe the changes.
+
+### New Kubelet Configuration
+
+We will be introducing two new parameters to `KubeletConfiguration struct`
+defined in
+[https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/types.go](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/types.go).
+These two configurations, if set, will apply to every container of the Node
+where kubelet is running.
+
+|Name|Description|Default Value|Feature Gate|
+|--- |--- |--- |--- |
+|MemorySwapLimit|This parameter sets total memory limit (memory + swap). This limits the total amount of memory this container is allowed to swap to disk.|-2, which enable disable swap|SupportNodeMemorySwap|
+|MemorySwappiness|This configuration sets how aggressively the kernel will swap memory pages. By default, the host kernel can swap out a percentage of anonymous pages used by a container. Users can set value between 0 and 100, to tune this percentage.|Unset, which will use host value|SupportNodeMemorySwap|
+
+#### MemorySwapLimit details
+
+MemorySwapLimit configuration is a kubelet flag that only takes effect on a
+container that has a memory limit set, either explicitly from
+[PodSpec]([https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits)
+) or implicitly from [Resource
+Quota]([https://kubernetes.io/docs/concepts/policy/resource-quotas/](https://kubernetes.io/docs/concepts/policy/resource-quotas/)
+).
+
+For container with memory limit set, MemorySwapLimit setting will have the
+following effects, [similar to
+docker](https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details)
+
+* If MemorySwapLimit is set to a positive integer,
+  * If the memory limit of the container is greater or equal to
+    MemorySwapLimit, then no swap is allowed, the container does not have
+    access to swap.
+  * If the memory limit of the container is less than MemorySwapLimit, then
+    MemorySwapLimit represents the total amount of memory and swap that can be
+    used. For example, for a container with memory limit set to 300m, and
+    `MemorySwapLimit` set to 1g, the container can use 300m of memory and 700m (1g
+    - 300m) swap.
+* If MemorySwapLimit is set to 0, for containers with memory limit is set, the
+  container can use as much swap as the Memory limit setting, if the host
+  container has swap memory configured. For instance, if  a container requests
+  memory="300m" and MemorySwapLimit is not set, the container can use 600m in
+  total of memory and swap.
+* If MemorySwapLimit is explicitly set to -1, the container is allowed to use
+  unlimited swap, up to the amount available on the host system.
+* If MemorySwapLimit is explicitly set to -2,  the container does not have
+  access to swap. This value effectively prevents a container from using swap.
+
+In summary, for users experimenting with this feature
+
+|MemorySwapLimit|container memory limit (explicit or implicit)|Expected Behavior|Comment|
+|--- |--- |--- |--- |
+|Any|not set|N/A|Same as docker|
+|-2|N|no swap allowed, this is the default value||
+|-1|N|unlimited swap|Same as docker|
+|0|N|container can use up to N swap (ie: 2N memory+swap)|Same as docker|
+|X where X > 0|N where N < X|container can use up to X-N swap (ie: 2N memory+swap)|Same as docker|
+|X where X > 0|N where N >= X|no swap allowed (ie: N memory only)|Same as docker|
+
+#### MemorySwappiness details
+
+* A value of 0 turns off anonymous page swapping.
+* A value of 100 sets all anonymous pages as swappable.
+* By default, if you do not set MemorySwappiness, the value is inherited from
+  the host machine.
+
+### CRI Changes
+
+We will be introducing the following two parameters
+`memory_swap_limit_in_bytes` and `memory_swappiness` to `message
+LinuxContainerResources` defined in
+[https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L563-L580](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L563-L580)
+
+|Name|Type|Description|Default Value|Feature Gate|
+|--- |--- |--- |--- |--- |
+|`memory_swap_limit_in_bytes`|int64|set/show limit of memory+swap usage|Default 0, which is unspecified.|SupportNodeMemorySwap|
+|`memory_swappiness`|int64|set/show swappiness parameter|Default 0, which is unspecified.|SupportNodeMemorySwap|
 
 ### Test Plan
 
diff --git a/keps/sig-node/2400-node-swap/kep.yaml b/keps/sig-node/2400-node-swap/kep.yaml
index 72e9ea2bebd..cc8202d6130 100644
--- a/keps/sig-node/2400-node-swap/kep.yaml
+++ b/keps/sig-node/2400-node-swap/kep.yaml
@@ -2,6 +2,7 @@ title: Node system swap support
 kep-number: 2400
 authors:
   - "@ehashman"
+  - "@ike-ma"
 owning-sig: sig-node
 participating-sigs:
   - sig-node

From 27c30381c39633f49bfa20ddc5bc95c39ad2766f Mon Sep 17 00:00:00 2001
From: Elana Hashman <ehashman@redhat.com>
Date: Thu, 29 Apr 2021 16:08:56 -0700
Subject: [PATCH 3/9] Rewrite implementation details, address feedback

---
 keps/sig-node/2400-node-swap/README.md | 212 +++++++++++--------------
 keps/sig-node/2400-node-swap/kep.yaml  |   1 +
 2 files changed, 97 insertions(+), 116 deletions(-)

diff --git a/keps/sig-node/2400-node-swap/README.md b/keps/sig-node/2400-node-swap/README.md
index a3835f50cba..46617831b20 100644
--- a/keps/sig-node/2400-node-swap/README.md
+++ b/keps/sig-node/2400-node-swap/README.md
@@ -1,21 +1,5 @@
 # KEP-2400: Node system swap support
 
-<!--
-This is the title of your KEP. Keep it short, simple, and descriptive. A good
-title can help communicate what the KEP is and should be considered as part of
-any review.
--->
-
-<!--
-A table of contents is helpful for quickly jumping to sections of a KEP and for
-highlighting any additional information provided beyond the standard KEP
-template.
-
-Ensure the TOC is wrapped with
-  <code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
-tags, and then generate with `hack/update-toc.sh`.
--->
-
 <!-- toc -->
 - [Release Signoff Checklist](#release-signoff-checklist)
 - [Summary](#summary)
@@ -34,6 +18,10 @@ tags, and then generate with `hack/update-toc.sh`.
   - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
+  - [Enabling swap as an end user](#enabling-swap-as-an-end-user)
+  - [API Changes](#api-changes)
+    - [KubeConfig addition](#kubeconfig-addition)
+  - [CRI Changes](#cri-changes)
   - [Test Plan](#test-plan)
   - [Graduation Criteria](#graduation-criteria)
     - [Alpha](#alpha)
@@ -121,20 +109,24 @@ This KEP will be limited in scope to the first two scenarios. The third can be a
 - On Linux systems, when swap is provisioned and available, Kubelet can start up with swap on.
 - Configuration is available for CRI to set swap utilization available to Kubernetes workloads, defaulting to 0 swap.
 - Cluster administrators can enable and configure CRI swap utilization on a per-node basis.
+- Use of swap memory with both cgroupsv1 and cgroupsv2 is supported.
 
 ### Non-Goals
 
 - Provisioning swap. Swap must already be available on the system.
+- Setting [swappiness]. This can already be set on a system-wide level outside of Kubernetes.
 - Allocating swap on a per-workload basis with accounting (e.g. pod-level specification of swap). If desired, this should be designed and implemented as part of a follow-up KEP. This KEP is a prerequisite for that work.
 - Supporting zram, zswap, or other memory types like SGX EPC. These could be addressed in a follow-up KEP, and are out of scope.
 
+[swappiness]: https://en.wikipedia.org/wiki/Memory_paging#Swappiness
+
 ## Proposal
 
-I propose that, when swap is provisioned and available on a node, we allow cluster administrators to configure the Kubelet and CRI such that:
+We propose that, when swap is provisioned and available on a node, cluster administrators can configure the Kubelet and CRI such that:
 
 - The kubelet can start with swap on.
 - The CRI is updated such that by default, workloads will use 0 swap.
-- The CRI will have configuration available such that swap utilization can be configured for the entire node (e.g. as a percentage of pod memory requests).
+- The CRI will have configuration available such that swap utilization can be configured for the entire node.
 
 This proposal enables scenarios 1 and 2 above, but not 3.
 
@@ -201,133 +193,121 @@ This user story is addressed by scenario 2, and could benefit from 3.
 
 ### Notes/Constraints/Caveats (Optional)
 
-<!--
-What are the caveats to the proposal?
-What are some important details that didn't come across above?
-Go in to as much detail as necessary here.
-This might be a good place to talk about core concepts and how they relate.
--->
+In changing the CRI, we must ensure that container runtime downstreams are able to support the new configurations.
 
-### Risks and Mitigations
+We considered adding parameters for both per-workload `memory-swap` and `swappiness`. These are documented as part of the Open Containers [runtime specification] for Linux memory configuration. Since `memory-swap` is a per-workload parameter, and `swappiness` is optional and can be set globally, we are choosing to only expose `memory-swap` which will adjust swap available to workloads.
 
-Having swap available on a system reduces predictability. When swap is available to workloads, and is not accounted for on an individual workload-by-workload basis
+Since we are not currently setting `memory-swap` in the CRI, the default behaviour is to allocate the same amount of swap for a workload as memory requested. We will update the default to not permit the use of swap by setting `memory-swap` equal to `limit`.
 
-First, this risk is mitigated by preventing any workloads from using swap by default, even if it is enabled on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization.
+[runtime specification]: https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory
 
-Additionally, we mitigate this risk by quantifying system stability and then gathering test and production data to determine if system stability remains the same or is improved when swap is available to the system and/or workloads.
+### Risks and Mitigations
 
-Since swap provisioning is out of scope of this proposal, this enhancement poses little risk to Kubernetes clusters that will not enable swap.
+Having swap available on a system reduces predictability. Swap's performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure, and applications cannot directly control what portions of their memory usage are swapped out. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage.
 
-## Design Details
+This risk is mitigated by preventing any workloads from using swap by default, even if swap is enabled and available on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization.
 
-### TL;DR
+Additionally, we will mitigate this risk by determining a set of metrics to quantify system stability and then gathering test and production data to determine if system stability changes when swap is available to the system and/or workloads in a number of different scenarios.
 
-In a nutshell, the following implementation are planned for Memory Swap Support
-in 1.22 GKE alpha
+Since swap provisioning is out of scope of this proposal, this enhancement poses low risk to Kubernetes clusters that will not enable swap.
 
-1. Having a feature gate `SupportNodeMemorySwap` guarding against the memory
-   swap support feature
-2. Keep the default value of kubelet flag `--fail-on-swap` to `true` in order
-   to minimize the blast radius
-3. Introducing two new kubelet config `MemorySwapLimit` and `Swappiness`
-4. Introducing two new CRI parameter `memory_swap_limit_in_bytes` and `memory_swappiness`
-5. End to end wiring from kubelet config file to CRI
+## Design Details
 
-### Expected User Behaviour
+We summarize the implementation plan as following:
 
-For alpha, the feature gate `SupportNodeMemorySwap` is default to disabled, and
-`--fail-on-swap` flag value is the same as 1.21. Therefore, from Kubernetes
-user’s perspective, no behavior changes out of the box.
+1. Add a feature gate `NodeSwapEnabled` to enable swap support.
+1. Leave the default value of kubelet flag `--fail-on-swap` to `true`, to avoid changing default behaviour.
+1. Introduce a new kubelet config parameter, `MemorySwapLimit`.
+1. Introduce a new CRI parameter, `memory_swap_limit_in_bytes`.
+1. Integrate new kubelet config and pass values to CRI for container creation.
+1. Ensure container runtimes are updated so they can make use of the new CRI configuration.
 
-For users that are ready to explore the Memory Swap feature in 1.22 Alpha, they
-will need to complete the following steps
+### Enabling swap as an end user
 
-1. provision swap enable `SupportNodeMemorySwap` flag AND
-2. set `--fail-on-swap` flag to `false`
+Swap can be enabled as follows:
 
-Then, the user can start experimenting/fine tuning kubelet configuration
-`MemorySwapLimit` and/or `Swappiness` and observe the changes.
+1. Provision swap on the target worker nodes,
+1. Enable `NodeMemorySwap` flag on the kubelet,
+1. Set `--fail-on-swap` flag to `false`, and
+1. (Optional) Configure `MemorySwapLimit` in the KubeletConfig for tuning.
 
-### New Kubelet Configuration
+### API Changes
 
-We will be introducing two new parameters to `KubeletConfiguration struct`
-defined in
-[https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/types.go](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/types.go).
-These two configurations, if set, will apply to every container of the Node
-where kubelet is running.
+#### KubeConfig addition
 
-|Name|Description|Default Value|Feature Gate|
-|--- |--- |--- |--- |
-|MemorySwapLimit|This parameter sets total memory limit (memory + swap). This limits the total amount of memory this container is allowed to swap to disk.|-2, which enable disable swap|SupportNodeMemorySwap|
-|MemorySwappiness|This configuration sets how aggressively the kernel will swap memory pages. By default, the host kernel can swap out a percentage of anonymous pages used by a container. Users can set value between 0 and 100, to tune this percentage.|Unset, which will use host value|SupportNodeMemorySwap|
+We will add an optional `MemorySwapLimit` value to the `KubeletConfig` struct in [pkg/kubelet/apis/config/types.go] for a compatible API change as follows:
 
-#### MemorySwapLimit details
+[pkg/kubelet/apis/config/types.go]: https://github.com/kubernetes/kubernetes/blob/6baad0a1d45435ff5844061aebab624c89d698f8/pkg/kubelet/apis/config/types.go#L81
 
-MemorySwapLimit configuration is a kubelet flag that only takes effect on a
-container that has a memory limit set, either explicitly from
-[PodSpec]([https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits)
-) or implicitly from [Resource
-Quota]([https://kubernetes.io/docs/concepts/policy/resource-quotas/](https://kubernetes.io/docs/concepts/policy/resource-quotas/)
-).
+```go
+// KubeletConfiguration contains the configuration for the Kubelet
+type KubeletConfiguration struct {
+	metav1.TypeMeta
+...
+	// Configure swap memory available to container workloads.
+	// If not set, workloads cannot use swap.
+	// If set to 0, workloads can use as much swap as their memory limit.
+	// If set to -1, workloads can use unlimited swap, up to the system limit.
+	// If set to a positive integer, workloads can use a total of memory and swap up to this
+	// limit. When containers request more memory than this limit, they cannot use swap.
+	// +featureGate=NodeSwapEnabled
+	// +optional
+	MemorySwapLimit *int64
+}
+```
 
 For container with memory limit set, MemorySwapLimit setting will have the
-following effects, [similar to
-docker](https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details)
-
-* If MemorySwapLimit is set to a positive integer,
-  * If the memory limit of the container is greater or equal to
-    MemorySwapLimit, then no swap is allowed, the container does not have
-    access to swap.
-  * If the memory limit of the container is less than MemorySwapLimit, then
-    MemorySwapLimit represents the total amount of memory and swap that can be
-    used. For example, for a container with memory limit set to 300m, and
-    `MemorySwapLimit` set to 1g, the container can use 300m of memory and 700m (1g
-    - 300m) swap.
-* If MemorySwapLimit is set to 0, for containers with memory limit is set, the
-  container can use as much swap as the Memory limit setting, if the host
-  container has swap memory configured. For instance, if  a container requests
-  memory="300m" and MemorySwapLimit is not set, the container can use 600m in
-  total of memory and swap.
-* If MemorySwapLimit is explicitly set to -1, the container is allowed to use
-  unlimited swap, up to the amount available on the host system.
-* If MemorySwapLimit is explicitly set to -2,  the container does not have
-  access to swap. This value effectively prevents a container from using swap.
-
-In summary, for users experimenting with this feature
-
-|MemorySwapLimit|container memory limit (explicit or implicit)|Expected Behavior|Comment|
-|--- |--- |--- |--- |
-|Any|not set|N/A|Same as docker|
-|-2|N|no swap allowed, this is the default value||
-|-1|N|unlimited swap|Same as docker|
-|0|N|container can use up to N swap (ie: 2N memory+swap)|Same as docker|
-|X where X > 0|N where N < X|container can use up to X-N swap (ie: 2N memory+swap)|Same as docker|
-|X where X > 0|N where N >= X|no swap allowed (ie: N memory only)|Same as docker|
-
-#### MemorySwappiness details
-
-* A value of 0 turns off anonymous page swapping.
-* A value of 100 sets all anonymous pages as swappable.
-* By default, if you do not set MemorySwappiness, the value is inherited from
-  the host machine.
+following effects, following the [Docker] and open container specification:
+
+* If `MemorySwapLimit` is not set, containers do not have access to swap. This
+  value effectively prevents a container from using swap, even if it is enabled
+  on a system.
+* If `MemorySwapLimit` is set to 0, for containers with memory limit is set, the
+  container can use as much swap as its memory limit setting. For instance, if
+  a container requests 300Mi memory and `MemorySwapLimit` is not set, the
+  container can use 600Mi total memory and swap.
+* If `MemorySwapLimit` is set to -1, the container is allowed to use
+  unlimited swap, up to the maximum amount available on the host system.
+* If `MemorySwapLimit` is set to a positive integer, then for containers with a
+  memory limit set, that value represents the system-wide maximum limit for
+  combined memory and swap usage of a container. For example, if
+  `MemorySwapLimit` is set to 1073742000 (1Gi):
+  * If the container's memory limit is 300Mi, it can use 1Gi combined memory
+    and swap (e.g. up to 700Mi swap).
+  * If the container's memory limit is 700Mi, it can use 1Gi combined memory
+    and swap (e.g. up to 300Mi swap).
+  * If the container's memory limit is 1Gi or greater, it cannot use swap.
+
+[docker]: https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details
 
 ### CRI Changes
 
-We will be introducing the following two parameters
-`memory_swap_limit_in_bytes` and `memory_swappiness` to `message
-LinuxContainerResources` defined in
-[https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L563-L580](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L563-L580)
-
-|Name|Type|Description|Default Value|Feature Gate|
-|--- |--- |--- |--- |--- |
-|`memory_swap_limit_in_bytes`|int64|set/show limit of memory+swap usage|Default 0, which is unspecified.|SupportNodeMemorySwap|
-|`memory_swappiness`|int64|set/show swappiness parameter|Default 0, which is unspecified.|SupportNodeMemorySwap|
+The CRI requires a corresponding change in order to allow the kubelet to set swap usage in container runtimes.
+We will introduce a parameter `memory_swap_limit_in_bytes` to the CRI API (found in [k8s.io/cri-api/pkg/apis/runtime/v1/api.proto]):
+
+[k8s.io/cri-api/pkg/apis/runtime/v1/api.proto]: https://github.com/kubernetes/kubernetes/blob/6baad0a1d45435ff5844061aebab624c89d698f8/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L563-L580
+
+```go
+// LinuxContainerResources specifies Linux specific configuration for
+// resources.
+message LinuxContainerResources {
+...
+    // Memory limit in bytes. Default: 0 (not specified).
+    int64 memory_limit_in_bytes = 4;
+    // Memory + swap limit in bytes. Default: 0 (not specified).
+    int64 memory_swap_limit_in_bytes = 9;
+...
+    // List of HugepageLimits to limit the HugeTLB usage of container per page size. Default: nil (not specified).
+    repeated HugepageLimit hugepage_limits = 8;
+}
+```
 
 ### Test Plan
 
 For alpha:
 
 - Swap scenarios are enabled in test-infra for at least two Linux distributions. e2e suites will be run against them.
+  - Container runtimes must be bumped in CI to use the new CRI.
 - Data should be gathered from a number of use cases to guide beta graduation and further development efforts.
 
 Once this data is available, additional test plans should be added for the next phase of graduation.
@@ -426,7 +406,7 @@ Pick one of these and delete the rest.
 
 - [x] Feature gate (also fill in values in `kep.yaml`)
   - Feature gate name: NodeSwapEnabled
-  - Components depending on the feature gate: Kubelet
+  - Components depending on the feature gate: API Server, Kubelet
 - [ ] Other
   - Describe the mechanism:
   - Will enabling / disabling the feature require downtime of the control
diff --git a/keps/sig-node/2400-node-swap/kep.yaml b/keps/sig-node/2400-node-swap/kep.yaml
index cc8202d6130..ad4b52d3082 100644
--- a/keps/sig-node/2400-node-swap/kep.yaml
+++ b/keps/sig-node/2400-node-swap/kep.yaml
@@ -35,5 +35,6 @@ milestone:
 feature-gates:
   - name: NodeSwapEnabled
     components:
+      - kube-apiserver
       - kubelet
 disable-supported: false

From 15eabc886fa19ab5a60abb842369df1931770a54 Mon Sep 17 00:00:00 2001
From: Elana Hashman <ehashman@redhat.com>
Date: Tue, 4 May 2021 16:22:38 -0700
Subject: [PATCH 4/9] Update KEP with API review feedback

---
 keps/sig-node/2400-node-swap/README.md | 61 ++++++++++++++++----------
 1 file changed, 38 insertions(+), 23 deletions(-)

diff --git a/keps/sig-node/2400-node-swap/README.md b/keps/sig-node/2400-node-swap/README.md
index 46617831b20..76a94b806b1 100644
--- a/keps/sig-node/2400-node-swap/README.md
+++ b/keps/sig-node/2400-node-swap/README.md
@@ -97,7 +97,7 @@ There are hence a number of possible ways that one could envision swap use on a
 
 ### Scenarios
 
-1. Swap is enabled only at the system level. The CRI does not permit user workloads to use swap. (This scenario is a prerequisite for the following use cases.)
+1. Swap is enabled on a node's host system, but the CRI does not permit Kubernetes workloads to use swap. (This scenario is a prerequisite for the following use cases.)
 1. Swap is enabled at the node level. The CRI can be globally configured to permit user workloads scheduled on the node to use some quantity of swap.
 1. Swap is set on a per-workload basis. The CRI sets permitted swap utilization on each individual workload.
 
@@ -245,33 +245,46 @@ type KubeletConfiguration struct {
 	metav1.TypeMeta
 ...
 	// Configure swap memory available to container workloads.
-	// If not set, workloads cannot use swap.
-	// If set to 0, workloads can use as much swap as their memory limit.
-	// If set to -1, workloads can use unlimited swap, up to the system limit.
-	// If set to a positive integer, workloads can use a total of memory and swap up to this
-	// limit. When containers request more memory than this limit, they cannot use swap.
 	// +featureGate=NodeSwapEnabled
 	// +optional
-	MemorySwapLimit *int64
+	MemorySwap MemorySwapConfiguration
+}
+
+type MemorySwapConfiguration struct {
+	// Configure swap memory available to container workloads. May be one of
+	// "", "NoSwap": workloads cannot use swap
+	// "WorkloadSpecifiedSwapLimit": workloads can use as much swap as their memory limit.
+	// "UnlimitedSwap": workloads can use unlimited swap, up to the system limit.
+	// "LimitedSwap": workloads can use a total of memory and swap up to this
+	// limit. When containers request more memory than this limit, they cannot use swap.
+	SwapBehavior string
+
+	LimitedSwap *LimitedSwapConfiguration
+}
+
+type LimitedSwapConfiguration struct {
+	PerWorkloadMemorySwapLimit resource.Quantity
 }
 ```
 
-For container with memory limit set, MemorySwapLimit setting will have the
-following effects, following the [Docker] and open container specification:
-
-* If `MemorySwapLimit` is not set, containers do not have access to swap. This
-  value effectively prevents a container from using swap, even if it is enabled
-  on a system.
-* If `MemorySwapLimit` is set to 0, for containers with memory limit is set, the
-  container can use as much swap as its memory limit setting. For instance, if
-  a container requests 300Mi memory and `MemorySwapLimit` is not set, the
-  container can use 600Mi total memory and swap.
-* If `MemorySwapLimit` is set to -1, the container is allowed to use
-  unlimited swap, up to the maximum amount available on the host system.
-* If `MemorySwapLimit` is set to a positive integer, then for containers with a
-  memory limit set, that value represents the system-wide maximum limit for
-  combined memory and swap usage of a container. For example, if
-  `MemorySwapLimit` is set to 1073742000 (1Gi):
+The `MemorySwapConfiguration.SwapBehavior` setting will have the following
+effects, based on the [Docker] and open container specification for the
+`--memory-swap` flag:
+
+* If `SwapBehavior` is not set or set to `"NoSwap"`, containers do not have
+  access to swap. This value effectively prevents a container from using swap,
+  even if it is enabled on a system.
+* If `SwapBehavior` is set to `"WorkloadSpecifiedSwapLimit"`, then for
+  containers with memory limit is set, the container can use as much swap as
+  its memory limit setting. For instance, if a container requests 300Mi memory
+  and `MemorySwapLimit` is not set, the container can use 600Mi total memory
+  and swap.
+* If `SwapBehavior` is set to `"UnlimitedSwap"`, the container is allowed to
+  use unlimited swap, up to the maximum amount available on the host system.
+* If `SwapBehavior` is set to a `"LimitedSwap"`, then the `LimitedSwap`
+  configuration must also be set. `LimitedSwap.PerWorkloadMemorySwapLimit`
+  represents the system-wide maximum limit for combined memory and swap usage
+  of a container. For example, if the limit is set to `1Gi`:
   * If the container's memory limit is 300Mi, it can use 1Gi combined memory
     and swap (e.g. up to 700Mi swap).
   * If the container's memory limit is 700Mi, it can use 1Gi combined memory
@@ -309,6 +322,7 @@ For alpha:
 - Swap scenarios are enabled in test-infra for at least two Linux distributions. e2e suites will be run against them.
   - Container runtimes must be bumped in CI to use the new CRI.
 - Data should be gathered from a number of use cases to guide beta graduation and further development efforts.
+  - Focus should be on supported user stories as listed above.
 
 Once this data is available, additional test plans should be added for the next phase of graduation.
 
@@ -331,6 +345,7 @@ Once this data is available, additional test plans should be added for the next
 
 #### GA
 
+- Test a wide variety of scenarios that may be affected by swap support, such as workloads using tmpfs storage.
 - Remove feature flag.
 
 ### Upgrade / Downgrade Strategy

From 7b58bd2b3e74485f274ba5ca24d82601bbf7578d Mon Sep 17 00:00:00 2001
From: Elana Hashman <ehashman@redhat.com>
Date: Wed, 5 May 2021 08:16:48 -0700
Subject: [PATCH 5/9] Update metadata and add PRR

---
 keps/prod-readiness/sig-node/2400.yaml | 3 +++
 keps/sig-node/2400-node-swap/kep.yaml  | 9 ++++++---
 2 files changed, 9 insertions(+), 3 deletions(-)
 create mode 100644 keps/prod-readiness/sig-node/2400.yaml

diff --git a/keps/prod-readiness/sig-node/2400.yaml b/keps/prod-readiness/sig-node/2400.yaml
new file mode 100644
index 00000000000..1eb33a410c5
--- /dev/null
+++ b/keps/prod-readiness/sig-node/2400.yaml
@@ -0,0 +1,3 @@
+kep-number: 2400
+alpha:
+  approver: "@deads2k"
diff --git a/keps/sig-node/2400-node-swap/kep.yaml b/keps/sig-node/2400-node-swap/kep.yaml
index ad4b52d3082..b514f563426 100644
--- a/keps/sig-node/2400-node-swap/kep.yaml
+++ b/keps/sig-node/2400-node-swap/kep.yaml
@@ -6,15 +6,18 @@ authors:
 owning-sig: sig-node
 participating-sigs:
   - sig-node
-status: provisional
+status: implementable
 creation-date: 2021-04-06
 reviewers:
-  - TBD
+  - "@SergeyKanzhelev"
+  - "@anguslees"
+  - "@deads2k"
+  - "@sftim"
 approvers:
   - "@derekwaynecarr"
   - "@dchen1107"
 prr-approvers:
-  - "@johnbelamaric"
+  - "@deads2k"
 
 # The target maturity stage in the current dev cycle for this KEP.
 stage: alpha

From 84adca77276a05ee5abab3aab48dceb525274381 Mon Sep 17 00:00:00 2001
From: Elana Hashman <ehashman@redhat.com>
Date: Fri, 7 May 2021 14:22:25 -0700
Subject: [PATCH 6/9] Reflow to 80-char line width

---
 keps/sig-node/2400-node-swap/README.md | 237 ++++++++++++++++++-------
 1 file changed, 173 insertions(+), 64 deletions(-)

diff --git a/keps/sig-node/2400-node-swap/README.md b/keps/sig-node/2400-node-swap/README.md
index 76a94b806b1..14355f61192 100644
--- a/keps/sig-node/2400-node-swap/README.md
+++ b/keps/sig-node/2400-node-swap/README.md
@@ -83,50 +83,76 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 
 ## Summary
 
-Kubernetes currently does not support the use of [swap memory](https://en.wikipedia.org/wiki/Paging#Linux) on Linux, as it is difficult to provide guarantees and account for pod memory utilization when swap is involved. As part of Kubernetes’ earlier design, [swap support was considered out of scope](https://github.com/kubernetes/kubernetes/issues/7294).
-
-However, there are a [number of use cases](#user-stories) that would benefit from Kubernetes nodes supporting swap. Hence, this proposal aims to add swap support to nodes in a controlled, predictable manner so that Kubernetes users can perform testing and provide data to continue building cluster capabilities on top of swap.
+Kubernetes currently does not support the use of [swap
+memory](https://en.wikipedia.org/wiki/Paging#Linux) on Linux, as it is
+difficult to provide guarantees and account for pod memory utilization when
+swap is involved. As part of Kubernetes’ earlier design, [swap support was
+considered out of scope](https://github.com/kubernetes/kubernetes/issues/7294).
+
+However, there are a [number of use cases](#user-stories) that would benefit
+from Kubernetes nodes supporting swap. Hence, this proposal aims to add swap
+support to nodes in a controlled, predictable manner so that Kubernetes users
+can perform testing and provide data to continue building cluster capabilities
+on top of swap.
 
 ## Motivation
 
 There are two distinct types of user for swap, who may overlap:
-- node administrators, who may want swap available for node-level performance tuning and stability/reducing noisy neighbour issues
-- application developers, who have written applications that would benefit from using swap memory
+- node administrators, who may want swap available for node-level performance
+  tuning and stability/reducing noisy neighbour issues
+- application developers, who have written applications that would benefit from
+  using swap memory
 
-There are hence a number of possible ways that one could envision swap use on a node.
+There are hence a number of possible ways that one could envision swap use on a
+node.
 
 ### Scenarios
 
-1. Swap is enabled on a node's host system, but the CRI does not permit Kubernetes workloads to use swap. (This scenario is a prerequisite for the following use cases.)
-1. Swap is enabled at the node level. The CRI can be globally configured to permit user workloads scheduled on the node to use some quantity of swap.
-1. Swap is set on a per-workload basis. The CRI sets permitted swap utilization on each individual workload.
+1. Swap is enabled on a node's host system, but the CRI does not permit
+   Kubernetes workloads to use swap. (This scenario is a prerequisite for the
+   following use cases.)
+1. Swap is enabled at the node level. The CRI can be globally configured to
+   permit user workloads scheduled on the node to use some quantity of swap.
+1. Swap is set on a per-workload basis. The CRI sets permitted swap utilization
+   on each individual workload.
 
-This KEP will be limited in scope to the first two scenarios. The third can be addressed in a follow-up KEP. The enablement work that is in scope for this KEP will be necessary to implement the third scenario.
+This KEP will be limited in scope to the first two scenarios. The third can be
+addressed in a follow-up KEP. The enablement work that is in scope for this KEP
+will be necessary to implement the third scenario.
 
 
 ### Goals
 
-- On Linux systems, when swap is provisioned and available, Kubelet can start up with swap on.
-- Configuration is available for CRI to set swap utilization available to Kubernetes workloads, defaulting to 0 swap.
-- Cluster administrators can enable and configure CRI swap utilization on a per-node basis.
+- On Linux systems, when swap is provisioned and available, Kubelet can start
+  up with swap on.
+- Configuration is available for CRI to set swap utilization available to
+  Kubernetes workloads, defaulting to 0 swap.
+- Cluster administrators can enable and configure CRI swap utilization on a
+  per-node basis.
 - Use of swap memory with both cgroupsv1 and cgroupsv2 is supported.
 
 ### Non-Goals
 
 - Provisioning swap. Swap must already be available on the system.
-- Setting [swappiness]. This can already be set on a system-wide level outside of Kubernetes.
-- Allocating swap on a per-workload basis with accounting (e.g. pod-level specification of swap). If desired, this should be designed and implemented as part of a follow-up KEP. This KEP is a prerequisite for that work.
-- Supporting zram, zswap, or other memory types like SGX EPC. These could be addressed in a follow-up KEP, and are out of scope.
+- Setting [swappiness]. This can already be set on a system-wide level outside
+  of Kubernetes.
+- Allocating swap on a per-workload basis with accounting (e.g. pod-level
+  specification of swap). If desired, this should be designed and implemented
+  as part of a follow-up KEP. This KEP is a prerequisite for that work.
+- Supporting zram, zswap, or other memory types like SGX EPC. These could be
+  addressed in a follow-up KEP, and are out of scope.
 
 [swappiness]: https://en.wikipedia.org/wiki/Memory_paging#Swappiness
 
 ## Proposal
 
-We propose that, when swap is provisioned and available on a node, cluster administrators can configure the Kubelet and CRI such that:
+We propose that, when swap is provisioned and available on a node, cluster
+administrators can configure the Kubelet and CRI such that:
 
 - The kubelet can start with swap on.
 - The CRI is updated such that by default, workloads will use 0 swap.
-- The CRI will have configuration available such that swap utilization can be configured for the entire node.
+- The CRI will have configuration available such that swap utilization can be
+  configured for the entire node.
 
 This proposal enables scenarios 1 and 2 above, but not 3.
 
@@ -134,7 +160,9 @@ This proposal enables scenarios 1 and 2 above, but not 3.
 
 #### Improved Node Stability
 
-cgroupsv2 improved memory management algos, such as oomd, currently require swap. Hence, having a small amount of swap available on nodes could improve better resource pressure handling and recovery.
+cgroupsv2 improved memory management algos, such as oomd, currently require
+swap. Hence, having a small amount of swap available on nodes could improve
+better resource pressure handling and recovery.
 
 - https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#id1
 - https://chrisdown.name/2018/01/02/in-defence-of-swap.html
@@ -145,28 +173,46 @@ This user story is addressed by scenario 1 and 2, and could benefit from 3.
 
 #### Long-running applications that swap out startup memory
 
-- Applications such as the Java and Node runtimes rely on swap for optimal performance https://github.com/kubernetes/kubernetes/issues/53533#issue-263475425
-- Initialization logic of applications can be safely swapped out without affecting long-running application resource usage https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-615967154
+- Applications such as the Java and Node runtimes rely on swap for optimal
+  performance
+  https://github.com/kubernetes/kubernetes/issues/53533#issue-263475425
+- Initialization logic of applications can be safely swapped out without
+  affecting long-running application resource usage
+  https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-615967154
 
 This user story is addressed by scenario 2, and could benefit from 3.
 
 #### Memory Flexibility
 
-This user story addresses cases in which cost of additional memory is prohibitive, or elastic scaling is impossible (e.g. on-premise/bare metal deployments).
-
-- Occasional cron job with high memory usage and lack of swap support means cloud nodes must always be allocated for maximum possible memory utilization, leading to overprovisioning/high costs https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-354832960
-- Lack of swap support would require provisioning 3x the amount of memory as required with swap https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-617654228
-- On-premise deployment can’t horizontally scale available memory based on load https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-637715138
-- Scaling resources is technically feasible but cost-prohibitive, swap provides flexibility at lower cost https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-553713502
+This user story addresses cases in which cost of additional memory is
+prohibitive, or elastic scaling is impossible (e.g. on-premise/bare metal
+deployments).
+
+- Occasional cron job with high memory usage and lack of swap support means
+  cloud nodes must always be allocated for maximum possible memory utilization,
+  leading to overprovisioning/high costs
+  https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-354832960
+- Lack of swap support would require provisioning 3x the amount of memory as
+  required with swap
+  https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-617654228
+- On-premise deployment can’t horizontally scale available memory based on load
+  https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-637715138
+- Scaling resources is technically feasible but cost-prohibitive, swap provides
+  flexibility at lower cost
+  https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-553713502
 
 This user story is addressed by scenario 2, and could benefit from 3.
 
 #### Local development and systems with fast storage
 
-Local development or single-node clusters and systems with fast storage may benefit from using available swap (e.g. NVMe swap partitions, one-node clusters).
+Local development or single-node clusters and systems with fast storage may
+benefit from using available swap (e.g. NVMe swap partitions, one-node
+clusters).
 
-- Single node, local Kubernetes deployment on laptop https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-361748518
-- Linux has optimizations for swap on SSD, allowing for performance boosts https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-589275277
+- Single node, local Kubernetes deployment on laptop
+  https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-361748518
+- Linux has optimizations for swap on SSD, allowing for performance boosts
+  https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-589275277
 
 This user story is addressed by scenarios 1 and 2, and could benefit from 3.
 
@@ -174,49 +220,82 @@ This user story is addressed by scenarios 1 and 2, and could benefit from 3.
 
 For example, edge devices with limited memory.
 
-- Edge compute systems/devices with small memory footprints (\<2Gi) https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-751398086
-- Clusters with nodes \<4Gi memory https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-751404417
+- Edge compute systems/devices with small memory footprints (\<2Gi)
+  https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-751398086
+- Clusters with nodes \<4Gi memory
+  https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-751404417
 
 This user story is addressed by scenario 2, and could benefit from 3.
 
 #### Virtualization management overhead
 
-This would apply to virtualized Kubernetes workloads such as VMs launched by kubevirt.
+This would apply to virtualized Kubernetes workloads such as VMs launched by
+kubevirt.
 
-Every VM comes with a management related overhead which can sporadically be pretty significant (memory streaming, SRIOV attachment, gpu attachment, virtio-fs, …). Swap helps to not request much more memory to deal with short term worst-case scenarios.
+Every VM comes with a management related overhead which can sporadically be
+pretty significant (memory streaming, SRIOV attachment, gpu attachment,
+virtio-fs, …). Swap helps to not request much more memory to deal with short
+term worst-case scenarios.
 
-With virtualization, clusters are typically provisioned based on the workloads’ memory consumption, and any infrastructure container overhead is overcommitted. This overhead could be safely swapped out.
+With virtualization, clusters are typically provisioned based on the workloads’
+memory consumption, and any infrastructure container overhead is overcommitted.
+This overhead could be safely swapped out.
 
-- Required for live migration of VMs https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-754878431
+- Required for live migration of VMs
+  https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-754878431
 
 This user story is addressed by scenario 2, and could benefit from 3.
 
 ### Notes/Constraints/Caveats (Optional)
 
-In changing the CRI, we must ensure that container runtime downstreams are able to support the new configurations.
+In changing the CRI, we must ensure that container runtime downstreams are able
+to support the new configurations.
 
-We considered adding parameters for both per-workload `memory-swap` and `swappiness`. These are documented as part of the Open Containers [runtime specification] for Linux memory configuration. Since `memory-swap` is a per-workload parameter, and `swappiness` is optional and can be set globally, we are choosing to only expose `memory-swap` which will adjust swap available to workloads.
+We considered adding parameters for both per-workload `memory-swap` and
+`swappiness`. These are documented as part of the Open Containers [runtime
+specification] for Linux memory configuration. Since `memory-swap` is a
+per-workload parameter, and `swappiness` is optional and can be set globally,
+we are choosing to only expose `memory-swap` which will adjust swap available
+to workloads.
 
-Since we are not currently setting `memory-swap` in the CRI, the default behaviour is to allocate the same amount of swap for a workload as memory requested. We will update the default to not permit the use of swap by setting `memory-swap` equal to `limit`.
+Since we are not currently setting `memory-swap` in the CRI, the default
+behaviour is to allocate the same amount of swap for a workload as memory
+requested. We will update the default to not permit the use of swap by setting
+`memory-swap` equal to `limit`.
 
 [runtime specification]: https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory
 
 ### Risks and Mitigations
 
-Having swap available on a system reduces predictability. Swap's performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure, and applications cannot directly control what portions of their memory usage are swapped out. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage.
+Having swap available on a system reduces predictability. Swap's performance is
+worse than regular memory, sometimes by many orders of magnitude, which can
+cause unexpected performance regressions. Furthermore, swap changes a system's
+behaviour under memory pressure, and applications cannot directly control what
+portions of their memory usage are swapped out. Since enabling swap permits
+greater memory usage for workloads in Kubernetes that cannot be predictably
+accounted for, it also increases the risk of noisy neighbours and unexpected
+packing configurations, as the scheduler cannot account for swap memory usage.
 
-This risk is mitigated by preventing any workloads from using swap by default, even if swap is enabled and available on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization.
+This risk is mitigated by preventing any workloads from using swap by default,
+even if swap is enabled and available on a system. This will allow a cluster
+administrator to test swap utilization just at the system level without
+introducing unpredictability to workload resource utilization.
 
-Additionally, we will mitigate this risk by determining a set of metrics to quantify system stability and then gathering test and production data to determine if system stability changes when swap is available to the system and/or workloads in a number of different scenarios.
+Additionally, we will mitigate this risk by determining a set of metrics to
+quantify system stability and then gathering test and production data to
+determine if system stability changes when swap is available to the system
+and/or workloads in a number of different scenarios.
 
-Since swap provisioning is out of scope of this proposal, this enhancement poses low risk to Kubernetes clusters that will not enable swap.
+Since swap provisioning is out of scope of this proposal, this enhancement
+poses low risk to Kubernetes clusters that will not enable swap.
 
 ## Design Details
 
 We summarize the implementation plan as following:
 
 1. Add a feature gate `NodeSwapEnabled` to enable swap support.
-1. Leave the default value of kubelet flag `--fail-on-swap` to `true`, to avoid changing default behaviour.
+1. Leave the default value of kubelet flag `--fail-on-swap` to `true`, to avoid
+   changing default behaviour.
 1. Introduce a new kubelet config parameter, `MemorySwapLimit`.
 1. Introduce a new CRI parameter, `memory_swap_limit_in_bytes`.
 1. Integrate new kubelet config and pass values to CRI for container creation.
@@ -235,7 +314,8 @@ Swap can be enabled as follows:
 
 #### KubeConfig addition
 
-We will add an optional `MemorySwapLimit` value to the `KubeletConfig` struct in [pkg/kubelet/apis/config/types.go] for a compatible API change as follows:
+We will add an optional `MemorySwapLimit` value to the `KubeletConfig` struct
+in [pkg/kubelet/apis/config/types.go] for a compatible API change as follows:
 
 [pkg/kubelet/apis/config/types.go]: https://github.com/kubernetes/kubernetes/blob/6baad0a1d45435ff5844061aebab624c89d698f8/pkg/kubelet/apis/config/types.go#L81
 
@@ -295,8 +375,10 @@ effects, based on the [Docker] and open container specification for the
 
 ### CRI Changes
 
-The CRI requires a corresponding change in order to allow the kubelet to set swap usage in container runtimes.
-We will introduce a parameter `memory_swap_limit_in_bytes` to the CRI API (found in [k8s.io/cri-api/pkg/apis/runtime/v1/api.proto]):
+The CRI requires a corresponding change in order to allow the kubelet to set
+swap usage in container runtimes.  We will introduce a parameter
+`memory_swap_limit_in_bytes` to the CRI API (found in
+[k8s.io/cri-api/pkg/apis/runtime/v1/api.proto]):
 
 [k8s.io/cri-api/pkg/apis/runtime/v1/api.proto]: https://github.com/kubernetes/kubernetes/blob/6baad0a1d45435ff5844061aebab624c89d698f8/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L563-L580
 
@@ -319,19 +401,23 @@ message LinuxContainerResources {
 
 For alpha:
 
-- Swap scenarios are enabled in test-infra for at least two Linux distributions. e2e suites will be run against them.
+- Swap scenarios are enabled in test-infra for at least two Linux
+  distributions. e2e suites will be run against them.
   - Container runtimes must be bumped in CI to use the new CRI.
-- Data should be gathered from a number of use cases to guide beta graduation and further development efforts.
+- Data should be gathered from a number of use cases to guide beta graduation
+  and further development efforts.
   - Focus should be on supported user stories as listed above.
 
-Once this data is available, additional test plans should be added for the next phase of graduation.
+Once this data is available, additional test plans should be added for the next
+phase of graduation.
 
 ### Graduation Criteria
 
 #### Alpha
 
 - Kubelet can be started with swap enabled.
-- KubeletConfig allows CRI to be configured with a percentage of swap available to workloads. This will default to 0.
+- KubeletConfig allows CRI to be configured with a percentage of swap available
+  to workloads. This will default to 0.
 - e2e test jobs are configured for Linux systems with swap enabled.
 
 
@@ -339,13 +425,15 @@ Once this data is available, additional test plans should be added for the next
 
 (Tentative.)
 
-- Determine a set of metrics for node QoS in order to evaluate the performance of nodes with and without swap enabled.
+- Determine a set of metrics for node QoS in order to evaluate the performance
+  of nodes with and without swap enabled.
 - Collect feedback from test user cases.
 - Improve coverage for appropriate scenarios in testgrid.
 
 #### GA
 
-- Test a wide variety of scenarios that may be affected by swap support, such as workloads using tmpfs storage.
+- Test a wide variety of scenarios that may be affected by swap support, such
+  as workloads using tmpfs storage.
 - Remove feature flag.
 
 ### Upgrade / Downgrade Strategy
@@ -364,7 +452,8 @@ enhancement:
 
 No changes are required on upgrade to maintain previous behaviour.
 
-It is possible to downgrade a kubelet on a node that was using swap, but this would require disabling the use of swap and setting `swapoff` on the node.
+It is possible to downgrade a kubelet on a node that was using swap, but this
+would require disabling the use of swap and setting `swapoff` on the node.
 
 ### Version Skew Strategy
 
@@ -436,9 +525,12 @@ Any change of default behavior may be surprising to users or break existing
 automations, so be extremely careful here.
 -->
 
-No. If the feature flag is enabled, the user must still set `--fail-swap-on=false` to adjust the default behaviour.
+No. If the feature flag is enabled, the user must still set
+`--fail-swap-on=false` to adjust the default behaviour.
 
-A node must have swap provisioned and available for this feature to work. If there is no swap available, but the feature flag is set to true, there will still be no change in existing behaviour.
+A node must have swap provisioned and available for this feature to work. If
+there is no swap available, but the feature flag is set to true, there will
+still be no change in existing behaviour.
 
 ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
 
@@ -449,7 +541,10 @@ feature, can it break the existing applications?).
 NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
 -->
 
-No. I don’t think it makes much sense to be able to provide users a meaningful ability to disable the feature flag at runtime, as this would be highly disruptive to workloads and difficult to implement. To turn this off, the kubelet would need to be restarted.
+No. I don’t think it makes much sense to be able to provide users a meaningful
+ability to disable the feature flag at runtime, as this would be highly
+disruptive to workloads and difficult to implement. To turn this off, the
+kubelet would need to be restarted.
 
 ###### What happens if we reenable the feature if it was previously rolled back?
 
@@ -464,7 +559,8 @@ with and without the feature, are necessary. At the very least, think about
 conversion tests if API types are being modified.
 -->
 
-N/A. This should be tested separately for scenarios with the flag enabled and disabled.
+N/A. This should be tested separately for scenarios with the flag enabled and
+disabled.
 
 ### Rollout, Upgrade and Rollback Planning
 
@@ -630,7 +726,8 @@ Describe them, providing:
   - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
 -->
 
-The KubeletConfig API object may slightly increase in size due to new config fields.
+The KubeletConfig API object may slightly increase in size due to new config
+fields.
 
 ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
 
@@ -643,7 +740,9 @@ Think about adding additional work or introducing new steps in between
 [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
 -->
 
-It is possible for this feature to affect performance of some worker node-level SLIs/SLOs. We will need to monitor for differences, particularly during beta testing, when evaluating this feature for beta and graduation.
+It is possible for this feature to affect performance of some worker node-level
+SLIs/SLOs. We will need to monitor for differences, particularly during beta
+testing, when evaluating this feature for beta and graduation.
 
 ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
 
@@ -657,7 +756,9 @@ This through this both in small and large cases, again with respect to the
 [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
 -->
 
-Yes. It will permit the utilization of swap memory (i.e. disk) on nodes. This is expected, as this enhancement is enabling cluster administrators to access this resource.
+Yes. It will permit the utilization of swap memory (i.e. disk) on nodes. This
+is expected, as this enhancement is enabling cluster administrators to access
+this resource.
 
 ### Troubleshooting
 
@@ -701,15 +802,23 @@ For each of them, fill in the following information by copying the below templat
 Why should this KEP _not_ be implemented?
 -->
 
-When swap is enabled, particularly for workloads, the kubelet’s resource accounting may become much less accurate. This may make cluster administration more difficult and less predictable.
+When swap is enabled, particularly for workloads, the kubelet’s resource
+accounting may become much less accurate. This may make cluster administration
+more difficult and less predictable.
 
-Currently, there exists an unsupported workaround, which is setting the kubelet flag `--fail-swap-on` to false.
+Currently, there exists an unsupported workaround, which is setting the kubelet
+flag `--fail-swap-on` to false.
 
 ## Alternatives
 
 ### Just set `--fail-swap-on=false`
 
-This is insufficient for most use cases because there is inconsistent control over how swap will be used by various container runtimes. Dockershim currently sets swap available for workloads to 0. The CRI does not restrict it at all. This inconsistency makes it difficult or impossible to use swap in production, particularly if a user wants to restrict workloads from using swap when using the CRI rather than dockershim.
+This is insufficient for most use cases because there is inconsistent control
+over how swap will be used by various container runtimes. Dockershim currently
+sets swap available for workloads to 0. The CRI does not restrict it at all.
+This inconsistency makes it difficult or impossible to use swap in production,
+particularly if a user wants to restrict workloads from using swap when using
+the CRI rather than dockershim.
 
 ## Infrastructure Needed (Optional)
 

From 64e639aabb512ee6feffc2f481ba394b76bad7de Mon Sep 17 00:00:00 2001
From: Elana Hashman <ehashman@redhat.com>
Date: Fri, 7 May 2021 14:55:11 -0700
Subject: [PATCH 7/9] Address PRR and other review comments

---
 keps/sig-node/2400-node-swap/README.md | 43 ++++++++++++++------------
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/keps/sig-node/2400-node-swap/README.md b/keps/sig-node/2400-node-swap/README.md
index 14355f61192..438743eb023 100644
--- a/keps/sig-node/2400-node-swap/README.md
+++ b/keps/sig-node/2400-node-swap/README.md
@@ -164,6 +164,7 @@ cgroupsv2 improved memory management algos, such as oomd, currently require
 swap. Hence, having a small amount of swap available on nodes could improve
 better resource pressure handling and recovery.
 
+- https://man7.org/linux/man-pages/man8/systemd-oomd.service.8.html
 - https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#id1
 - https://chrisdown.name/2018/01/02/in-defence-of-swap.html
 - https://media.ccc.de/v/ASG2018-175-oomd
@@ -258,10 +259,10 @@ per-workload parameter, and `swappiness` is optional and can be set globally,
 we are choosing to only expose `memory-swap` which will adjust swap available
 to workloads.
 
-Since we are not currently setting `memory-swap` in the CRI, the default
-behaviour is to allocate the same amount of swap for a workload as memory
-requested. We will update the default to not permit the use of swap by setting
-`memory-swap` equal to `limit`.
+Since we are not currently setting `memory-swap` in the CRI, the current
+default behaviour when `--fail-swap-on=false` is set is to allocate the same
+amount of swap for a workload as memory requested. We will update the default
+to not permit the use of swap by setting `memory-swap` equal to `limit`.
 
 [runtime specification]: https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory
 
@@ -347,9 +348,9 @@ type LimitedSwapConfiguration struct {
 }
 ```
 
-The `MemorySwapConfiguration.SwapBehavior` setting will have the following
-effects, based on the [Docker] and open container specification for the
-`--memory-swap` flag:
+We want to expose all possible swap settings based on the [Docker] and open
+container specification for the `--memory-swap` flag. Thus, the
+`MemorySwapConfiguration.SwapBehavior` setting will have the following effects:
 
 * If `SwapBehavior` is not set or set to `"NoSwap"`, containers do not have
   access to swap. This value effectively prevents a container from using swap,
@@ -361,7 +362,7 @@ effects, based on the [Docker] and open container specification for the
   and swap.
 * If `SwapBehavior` is set to `"UnlimitedSwap"`, the container is allowed to
   use unlimited swap, up to the maximum amount available on the host system.
-* If `SwapBehavior` is set to a `"LimitedSwap"`, then the `LimitedSwap`
+* If `SwapBehavior` is set to `"LimitedSwap"`, then the `LimitedSwap`
   configuration must also be set. `LimitedSwap.PerWorkloadMemorySwapLimit`
   represents the system-wide maximum limit for combined memory and swap usage
   of a container. For example, if the limit is set to `1Gi`:
@@ -387,13 +388,9 @@ swap usage in container runtimes.  We will introduce a parameter
 // resources.
 message LinuxContainerResources {
 ...
-    // Memory limit in bytes. Default: 0 (not specified).
-    int64 memory_limit_in_bytes = 4;
     // Memory + swap limit in bytes. Default: 0 (not specified).
     int64 memory_swap_limit_in_bytes = 9;
 ...
-    // List of HugepageLimits to limit the HugeTLB usage of container per page size. Default: nil (not specified).
-    repeated HugepageLimit hugepage_limits = 8;
 }
 ```
 
@@ -511,12 +508,16 @@ Pick one of these and delete the rest.
 - [x] Feature gate (also fill in values in `kep.yaml`)
   - Feature gate name: NodeSwapEnabled
   - Components depending on the feature gate: API Server, Kubelet
-- [ ] Other
-  - Describe the mechanism:
+- [x] Other
+  - Describe the mechanism: `--fail-swap-on=false` flag for kubelet must also
+    be set at kubelet start
   - Will enabling / disabling the feature require downtime of the control
-    plane?
+    plane? Yes. Flag must be set on kubelet start. To disable, kubelet must be
+    restarted. Hence, there would be brief control component downtime on a
+    given node.
   - Will enabling / disabling the feature require downtime or reprovisioning
     of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
+    Yes. See above; disabling would require brief node downtime.
 
 ###### Does enabling the feature change any default behavior?
 
@@ -541,10 +542,14 @@ feature, can it break the existing applications?).
 NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
 -->
 
-No. I don’t think it makes much sense to be able to provide users a meaningful
-ability to disable the feature flag at runtime, as this would be highly
-disruptive to workloads and difficult to implement. To turn this off, the
-kubelet would need to be restarted.
+No. The feature flag can be disabled while the `--fail-swap-on=false` flag is
+set, but this would result in undefined behaviour.
+
+To turn this off, the kubelet would need to be restarted. If a cluster admin
+wants to disable swap on the node without repartitioning the node, they could
+stop the kubelet, set `swapoff` on the node, and restart the kubelet with
+`--fail-swap-on=true`. The setting of the feature flag will be ignored in this
+case.
 
 ###### What happens if we reenable the feature if it was previously rolled back?
 

From 277c51a3f38ff53ae127868ccb3ffd8e38012427 Mon Sep 17 00:00:00 2001
From: Elana Hashman <ehashman@redhat.com>
Date: Tue, 11 May 2021 15:45:17 -0700
Subject: [PATCH 8/9] Update based on reviewer feedback

---
 keps/sig-node/2400-node-swap/README.md | 69 +++++++++++++++-----------
 1 file changed, 40 insertions(+), 29 deletions(-)

diff --git a/keps/sig-node/2400-node-swap/README.md b/keps/sig-node/2400-node-swap/README.md
index 438743eb023..1724c30574d 100644
--- a/keps/sig-node/2400-node-swap/README.md
+++ b/keps/sig-node/2400-node-swap/README.md
@@ -21,7 +21,7 @@
   - [Enabling swap as an end user](#enabling-swap-as-an-end-user)
   - [API Changes](#api-changes)
     - [KubeConfig addition](#kubeconfig-addition)
-  - [CRI Changes](#cri-changes)
+    - [CRI Changes](#cri-changes)
   - [Test Plan](#test-plan)
   - [Graduation Criteria](#graduation-criteria)
     - [Alpha](#alpha)
@@ -40,6 +40,7 @@
 - [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
   - [Just set <code>--fail-swap-on=false</code>](#just-set-)
+  - [Restrict swap usage at the cgroup level](#restrict-swap-usage-at-the-cgroup-level)
 - [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
 <!-- /toc -->
 
@@ -108,7 +109,7 @@ node.
 
 ### Scenarios
 
-1. Swap is enabled on a node's host system, but the CRI does not permit
+1. Swap is enabled on a node's host system, but the kubelet does not permit
    Kubernetes workloads to use swap. (This scenario is a prerequisite for the
    following use cases.)
 1. Swap is enabled at the node level. The CRI can be globally configured to
@@ -125,20 +126,23 @@ will be necessary to implement the third scenario.
 
 - On Linux systems, when swap is provisioned and available, Kubelet can start
   up with swap on.
-- Configuration is available for CRI to set swap utilization available to
+- Configuration is available for kubelet to set swap utilization available to
   Kubernetes workloads, defaulting to 0 swap.
-- Cluster administrators can enable and configure CRI swap utilization on a
+- Cluster administrators can enable and configure kubelet swap utilization on a
   per-node basis.
 - Use of swap memory with both cgroupsv1 and cgroupsv2 is supported.
 
 ### Non-Goals
 
+- Addressing non-Linux operating systems. Swap support will only be available
+  for Linux.
 - Provisioning swap. Swap must already be available on the system.
 - Setting [swappiness]. This can already be set on a system-wide level outside
   of Kubernetes.
 - Allocating swap on a per-workload basis with accounting (e.g. pod-level
   specification of swap). If desired, this should be designed and implemented
-  as part of a follow-up KEP. This KEP is a prerequisite for that work.
+  as part of a follow-up KEP. This KEP is a prerequisite for that work. Hence,
+  swap will be an overcommitted resource in the context of this KEP.
 - Supporting zram, zswap, or other memory types like SGX EPC. These could be
   addressed in a follow-up KEP, and are out of scope.
 
@@ -147,12 +151,12 @@ will be necessary to implement the third scenario.
 ## Proposal
 
 We propose that, when swap is provisioned and available on a node, cluster
-administrators can configure the Kubelet and CRI such that:
+administrators can configure the kubelet such that:
 
-- The kubelet can start with swap on.
-- The CRI is updated such that by default, workloads will use 0 swap.
-- The CRI will have configuration available such that swap utilization can be
-  configured for the entire node.
+- It can start with swap on.
+- It will direct the CRI to allocate Kubernetes workloads 0 swap by default.
+- It will have configuration options to configure swap utilization for the
+  entire node.
 
 This proposal enables scenarios 1 and 2 above, but not 3.
 
@@ -334,10 +338,8 @@ type KubeletConfiguration struct {
 type MemorySwapConfiguration struct {
 	// Configure swap memory available to container workloads. May be one of
 	// "", "NoSwap": workloads cannot use swap
-	// "WorkloadSpecifiedSwapLimit": workloads can use as much swap as their memory limit.
-	// "UnlimitedSwap": workloads can use unlimited swap, up to the system limit.
-	// "LimitedSwap": workloads can use a total of memory and swap up to this
-	// limit. When containers request more memory than this limit, they cannot use swap.
+	// "UnlimitedSwap": workloads can use unlimited swap, up to the allocatable limit.
+	// "LimitedSwap": workloads can use up to this limit of swap.
 	SwapBehavior string
 
 	LimitedSwap *LimitedSwapConfiguration
@@ -348,33 +350,25 @@ type LimitedSwapConfiguration struct {
 }
 ```
 
-We want to expose all possible swap settings based on the [Docker] and open
+We want to expose common swap configurations based on the [Docker] and open
 container specification for the `--memory-swap` flag. Thus, the
 `MemorySwapConfiguration.SwapBehavior` setting will have the following effects:
 
 * If `SwapBehavior` is not set or set to `"NoSwap"`, containers do not have
   access to swap. This value effectively prevents a container from using swap,
   even if it is enabled on a system.
-* If `SwapBehavior` is set to `"WorkloadSpecifiedSwapLimit"`, then for
-  containers with memory limit is set, the container can use as much swap as
-  its memory limit setting. For instance, if a container requests 300Mi memory
-  and `MemorySwapLimit` is not set, the container can use 600Mi total memory
-  and swap.
 * If `SwapBehavior` is set to `"UnlimitedSwap"`, the container is allowed to
   use unlimited swap, up to the maximum amount available on the host system.
 * If `SwapBehavior` is set to `"LimitedSwap"`, then the `LimitedSwap`
   configuration must also be set. `LimitedSwap.PerWorkloadMemorySwapLimit`
-  represents the system-wide maximum limit for combined memory and swap usage
-  of a container. For example, if the limit is set to `1Gi`:
-  * If the container's memory limit is 300Mi, it can use 1Gi combined memory
-    and swap (e.g. up to 700Mi swap).
-  * If the container's memory limit is 700Mi, it can use 1Gi combined memory
-    and swap (e.g. up to 300Mi swap).
-  * If the container's memory limit is 1Gi or greater, it cannot use swap.
+  represents the system-wide maximum limit for swap usage of a container. Note
+  that this limit applies to individual containers, and not at the pod-level,
+  in order to be set via the CRI rather than e.g. a [pod cgroup limit].
 
 [docker]: https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details
+[pod cgroup limit]: #restrict-swap-usage-at-the-cgroup-level
 
-### CRI Changes
+#### CRI Changes
 
 The CRI requires a corresponding change in order to allow the kubelet to set
 swap usage in container runtimes.  We will introduce a parameter
@@ -417,7 +411,6 @@ phase of graduation.
   to workloads. This will default to 0.
 - e2e test jobs are configured for Linux systems with swap enabled.
 
-
 #### Beta
 
 (Tentative.)
@@ -825,6 +818,24 @@ This inconsistency makes it difficult or impossible to use swap in production,
 particularly if a user wants to restrict workloads from using swap when using
 the CRI rather than dockershim.
 
+### Restrict swap usage at the cgroup level
+
+Setting a swap limit at the cgroup level would allow us to restrict the usage
+of swap on a pod-level, rather than container-level basis.
+
+For alpha, we are opting for the container-level basis to simplify the
+implementation (as the container runtimes already support configuration of swap
+with the `memory-swap-limit` parameter). This will also provide the necessary
+plumbing for container-level accounting of swap, if that is proposed in the
+future.
+
+In beta, we may want to revisit this.
+
+See the [Pod Resource Management design proposal] for more background on the
+cgroup limits the kubelet currently sets based on each QoS class.
+
+[Pod Resource Management design proposal]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/pod-resource-management.md#pod-level-cgroups
+
 ## Infrastructure Needed (Optional)
 
 <!--

From b5c0dae5d05261efea320c96778e4eb77f657249 Mon Sep 17 00:00:00 2001
From: Elana Hashman <ehashman@redhat.com>
Date: Wed, 12 May 2021 17:09:00 -0700
Subject: [PATCH 9/9] Address next round of reviewer feedback

---
 keps/sig-node/2400-node-swap/README.md | 73 ++++++++++++++------------
 1 file changed, 39 insertions(+), 34 deletions(-)

diff --git a/keps/sig-node/2400-node-swap/README.md b/keps/sig-node/2400-node-swap/README.md
index 1724c30574d..14a86170e6b 100644
--- a/keps/sig-node/2400-node-swap/README.md
+++ b/keps/sig-node/2400-node-swap/README.md
@@ -112,10 +112,11 @@ node.
 1. Swap is enabled on a node's host system, but the kubelet does not permit
    Kubernetes workloads to use swap. (This scenario is a prerequisite for the
    following use cases.)
-1. Swap is enabled at the node level. The CRI can be globally configured to
-   permit user workloads scheduled on the node to use some quantity of swap.
-1. Swap is set on a per-workload basis. The CRI sets permitted swap utilization
-   on each individual workload.
+1. Swap is enabled at the node level. The kubelet can permit Kubernetes
+   workloads scheduled on the node to use some quantity of swap, depending on
+   the configuration.
+1. Swap is set on a per-workload basis. The kubelet sets swap limits for each
+   individual workload.
 
 This KEP will be limited in scope to the first two scenarios. The third can be
 addressed in a follow-up KEP. The enablement work that is in scope for this KEP
@@ -164,9 +165,9 @@ This proposal enables scenarios 1 and 2 above, but not 3.
 
 #### Improved Node Stability
 
-cgroupsv2 improved memory management algos, such as oomd, currently require
-swap. Hence, having a small amount of swap available on nodes could improve
-better resource pressure handling and recovery.
+cgroupsv2 improved memory management algorithms, such as oomd, strongly
+recommend the use of swap. Hence, having a small amount of swap available on
+nodes could improve better resource pressure handling and recovery.
 
 - https://man7.org/linux/man-pages/man8/systemd-oomd.service.8.html
 - https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#id1
@@ -253,7 +254,7 @@ This user story is addressed by scenario 2, and could benefit from 3.
 
 ### Notes/Constraints/Caveats (Optional)
 
-In changing the CRI, we must ensure that container runtime downstreams are able
+In updating the CRI, we must ensure that container runtime downstreams are able
 to support the new configurations.
 
 We considered adding parameters for both per-workload `memory-swap` and
@@ -301,26 +302,31 @@ We summarize the implementation plan as following:
 1. Add a feature gate `NodeSwapEnabled` to enable swap support.
 1. Leave the default value of kubelet flag `--fail-on-swap` to `true`, to avoid
    changing default behaviour.
-1. Introduce a new kubelet config parameter, `MemorySwapLimit`.
+1. Introduce a new kubelet config parameter, `MemorySwap`, which configures how
+   much swap Kubernetes workloads can use on the node.
 1. Introduce a new CRI parameter, `memory_swap_limit_in_bytes`.
-1. Integrate new kubelet config and pass values to CRI for container creation.
-1. Ensure container runtimes are updated so they can make use of the new CRI configuration.
+1. Ensure container runtimes are updated so they can make use of the new CRI
+   configuration.
+1. Based on the behaviour set in the kubelet config, the kubelet will instruct
+   the CRI on the amount of swap to allocate to each container. The container
+   runtime will then write the swap settings to the container level cgroup.
 
 ### Enabling swap as an end user
 
 Swap can be enabled as follows:
 
 1. Provision swap on the target worker nodes,
-1. Enable `NodeMemorySwap` flag on the kubelet,
+1. Enable the `NodeMemorySwap` feature flag on the kubelet,
 1. Set `--fail-on-swap` flag to `false`, and
-1. (Optional) Configure `MemorySwapLimit` in the KubeletConfig for tuning.
+1. (Optional) Allow Kubernetes workloads to use swap by setting
+   `MemorySwap.SwapBehavior=UnlimitedSwap` in the kubelet config.
 
 ### API Changes
 
 #### KubeConfig addition
 
-We will add an optional `MemorySwapLimit` value to the `KubeletConfig` struct
-in [pkg/kubelet/apis/config/types.go] for a compatible API change as follows:
+We will add an optional `MemorySwap` value to the `KubeletConfig` struct
+in [pkg/kubelet/apis/config/types.go] as follows:
 
 [pkg/kubelet/apis/config/types.go]: https://github.com/kubernetes/kubernetes/blob/6baad0a1d45435ff5844061aebab624c89d698f8/pkg/kubelet/apis/config/types.go#L81
 
@@ -339,14 +345,7 @@ type MemorySwapConfiguration struct {
 	// Configure swap memory available to container workloads. May be one of
 	// "", "NoSwap": workloads cannot use swap
 	// "UnlimitedSwap": workloads can use unlimited swap, up to the allocatable limit.
-	// "LimitedSwap": workloads can use up to this limit of swap.
 	SwapBehavior string
-
-	LimitedSwap *LimitedSwapConfiguration
-}
-
-type LimitedSwapConfiguration struct {
-	PerWorkloadMemorySwapLimit resource.Quantity
 }
 ```
 
@@ -359,14 +358,8 @@ container specification for the `--memory-swap` flag. Thus, the
   even if it is enabled on a system.
 * If `SwapBehavior` is set to `"UnlimitedSwap"`, the container is allowed to
   use unlimited swap, up to the maximum amount available on the host system.
-* If `SwapBehavior` is set to `"LimitedSwap"`, then the `LimitedSwap`
-  configuration must also be set. `LimitedSwap.PerWorkloadMemorySwapLimit`
-  represents the system-wide maximum limit for swap usage of a container. Note
-  that this limit applies to individual containers, and not at the pod-level,
-  in order to be set via the CRI rather than e.g. a [pod cgroup limit].
 
 [docker]: https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details
-[pod cgroup limit]: #restrict-swap-usage-at-the-cgroup-level
 
 #### CRI Changes
 
@@ -406,24 +399,36 @@ phase of graduation.
 
 #### Alpha
 
-- Kubelet can be started with swap enabled.
-- KubeletConfig allows CRI to be configured with a percentage of swap available
-  to workloads. This will default to 0.
+- Kubelet can be started with swap enabled and will support two configurations
+  for Kubernetes workloads: `NoSwap` and `UnlimitedSwap`.
+- Kubelet can configure CRI to allocate swap to Kubernetes workloads. By
+  default, workloads will not be allocated any swap.
 - e2e test jobs are configured for Linux systems with swap enabled.
 
 #### Beta
 
-(Tentative.)
+_(Tentative.)_
 
+- Add support for controlling swap consumption at the pod level [via cgroups].
+  - Handle usage of swap during container restart boundaries for writes to tmpfs
+    (which may require pod cgroup change beyond what container runtime will do at
+    container cgroup boundary).
+- Add the ability to set a system-reserved quantity of swap from what kubelet
+  detects on the host.
+- Consider introducing new configuration modes for swap, such as a node-wide
+  swap limit for workloads.
 - Determine a set of metrics for node QoS in order to evaluate the performance
   of nodes with and without swap enabled.
+  - Better understand relationship of swap with memory QoS in cgroup v2
+    (particularly `memory.high` usage).
 - Collect feedback from test user cases.
 - Improve coverage for appropriate scenarios in testgrid.
 
+[via cgroups]: #restrict-swap-usage-at-the-cgroup-level
+
 #### GA
 
-- Test a wide variety of scenarios that may be affected by swap support, such
-  as workloads using tmpfs storage.
+- Test a wide variety of scenarios that may be affected by swap support.
 - Remove feature flag.
 
 ### Upgrade / Downgrade Strategy