From a1941aad0f162fd0320806b6c8faee6ec93b2a73 Mon Sep 17 00:00:00 2001 From: Elana Hashman Date: Tue, 6 Apr 2021 15:21:48 -0700 Subject: [PATCH 1/9] Add draft for node swap KEP --- keps/sig-node/2400-node-swap/README.md | 632 +++++++++++++++++++++++++ keps/sig-node/2400-node-swap/kep.yaml | 38 ++ 2 files changed, 670 insertions(+) create mode 100644 keps/sig-node/2400-node-swap/README.md create mode 100644 keps/sig-node/2400-node-swap/kep.yaml diff --git a/keps/sig-node/2400-node-swap/README.md b/keps/sig-node/2400-node-swap/README.md new file mode 100644 index 00000000000..3386e3fa890 --- /dev/null +++ b/keps/sig-node/2400-node-swap/README.md @@ -0,0 +1,632 @@ +# KEP-2400: Node system swap support + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Scenarios](#scenarios) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Improved Node Stability](#improved-node-stability) + - [Long-running applications that swap out startup memory](#long-running-applications-that-swap-out-startup-memory) + - [Memory Flexibility](#memory-flexibility) + - [Local development and systems with fast storage](#local-development-and-systems-with-fast-storage) + - [Low footprint systems](#low-footprint-systems) + - [Virtualization management overhead](#virtualization-management-overhead) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Just set --fail-swap-on=false](#just-set-) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) +- [ ] (R) Graduation criteria is in place +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +Kubernetes currently does not support the use of [swap memory](https://en.wikipedia.org/wiki/Paging#Linux) on Linux, as it is difficult to provide guarantees and account for pod memory utilization when swap is involved. As part of Kubernetes’ earlier design, [swap support was considered out of scope](https://github.com/kubernetes/kubernetes/issues/7294). + +However, there are a [number of use cases](#user-stories) that would benefit from Kubernetes nodes supporting swap. Hence, this proposal aims to add swap support to nodes in a controlled, predictable manner so that Kubernetes users can perform testing and provide data to continue building cluster capabilities on top of swap. + +## Motivation + +There are two distinct types of user for swap, who may overlap: +- node administrators, who may want swap available for node-level performance tuning and stability/reducing noisy neighbour issues +- application developers, who have written applications that would benefit from using swap memory + +There are hence a number of possible ways that one could envision swap use on a node. + +### Scenarios + +1. Swap is enabled only at the system level. The CRI does not permit user workloads to use swap. (This scenario is a prerequisite for the following use cases.) +1. Swap is enabled at the node level. The CRI can be globally configured to permit user workloads scheduled on the node to use some quantity of swap. +1. Swap is set on a per-workload basis. The CRI sets permitted swap utilization on each individual workload. + +This KEP will be limited in scope to the first two scenarios. The third can be addressed in a follow-up KEP. The enablement work that is in scope for this KEP will be necessary to implement the third scenario. + + +### Goals + +- On Linux systems, when swap is provisioned and available, Kubelet can start up with swap on. +- Configuration is available for CRI to set swap utilization available to Kubernetes workloads, defaulting to 0 swap. +- Cluster administrators can enable and configure CRI swap utilization on a per-node basis. + +### Non-Goals + +- Provisioning swap. Swap must already be available on the system. +- Allocating swap on a per-workload basis with accounting (e.g. pod-level specification of swap). If desired, this should be designed and implemented as part of a follow-up KEP. This KEP is a prerequisite for that work. +- Supporting zram, zswap, or other memory types like SGX EPC. These could be addressed in a follow-up KEP, and are out of scope. + +## Proposal + +I propose that, when swap is provisioned and available on a node, we allow cluster administrators to configure the Kubelet and CRI such that: + +- The kubelet can start with swap on. +- The CRI is updated such that by default, workloads will use 0 swap. +- The CRI will have configuration available such that swap utilization can be configured for the entire node (e.g. as a percentage of pod memory requests). + +This proposal enables scenarios 1 and 2 above, but not 3. + +### User Stories + +#### Improved Node Stability + +cgroupsv2 improved memory management algos, such as oomd, currently require swap. Hence, having a small amount of swap available on nodes could improve better resource pressure handling and recovery. + +- https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#id1 +- https://chrisdown.name/2018/01/02/in-defence-of-swap.html +- https://media.ccc.de/v/ASG2018-175-oomd +- https://github.com/facebookincubator/oomd/blob/master/docs/production_setup.md#swap + +This user story is addressed by scenario 1 and 2, and could benefit from 3. + +#### Long-running applications that swap out startup memory + +- Applications such as the Java and Node runtimes rely on swap for optimal performance https://github.com/kubernetes/kubernetes/issues/53533#issue-263475425 +- Initialization logic of applications can be safely swapped out without affecting long-running application resource usage https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-615967154 + +This user story is addressed by scenario 2, and could benefit from 3. + +#### Memory Flexibility + +This user story addresses cases in which cost of additional memory is prohibitive, or elastic scaling is impossible (e.g. on-premise/bare metal deployments). + +- Occasional cron job with high memory usage and lack of swap support means cloud nodes must always be allocated for maximum possible memory utilization, leading to overprovisioning/high costs https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-354832960 +- Lack of swap support would require provisioning 3x the amount of memory as required with swap https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-617654228 +- On-premise deployment can’t horizontally scale available memory based on load https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-637715138 +- Scaling resources is technically feasible but cost-prohibitive, swap provides flexibility at lower cost https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-553713502 + +This user story is addressed by scenario 2, and could benefit from 3. + +#### Local development and systems with fast storage + +Local development or single-node clusters and systems with fast storage may benefit from using available swap (e.g. NVMe swap partitions, one-node clusters). + +- Single node, local Kubernetes deployment on laptop https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-361748518 +- Linux has optimizations for swap on SSD, allowing for performance boosts https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-589275277 + +This user story is addressed by scenarios 1 and 2, and could benefit from 3. + +#### Low footprint systems + +For example, edge devices with limited memory. + +- Edge compute systems/devices with small memory footprints (\<2Gi) https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-751398086 +- Clusters with nodes \<4Gi memory https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-751404417 + +This user story is addressed by scenario 2, and could benefit from 3. + +#### Virtualization management overhead + +This would apply to virtualized Kubernetes workloads such as VMs launched by kubevirt. + +Every VM comes with a management related overhead which can sporadically be pretty significant (memory streaming, SRIOV attachment, gpu attachment, virtio-fs, …). Swap helps to not request much more memory to deal with short term worst-case scenarios. + +With virtualization, clusters are typically provisioned based on the workloads’ memory consumption, and any infrastructure container overhead is overcommitted. This overhead could be safely swapped out. + +- Required for live migration of VMs https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-754878431 + +This user story is addressed by scenario 2, and could benefit from 3. + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + +Having swap available on a system reduces predictability. When swap is available to workloads, and is not accounted for on an individual workload-by-workload basis + +First, this risk is mitigated by preventing any workloads from using swap by default, even if it is enabled on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization. + +Additionally, we mitigate this risk by quantifying system stability and then gathering test and production data to determine if system stability remains the same or is improved when swap is available to the system and/or workloads. + +Since swap provisioning is out of scope of this proposal, this enhancement poses little risk to Kubernetes clusters that will not enable swap. + +## Design Details + +\[In progress\] + +Need to add specifics here for: + +- Changes to `--fail-on-swap` flag +- CRI config details +- Where changes will need to be made so that dockershim and the CRI are consistent with swap control + +### Test Plan + +For alpha: + +- Swap scenarios are enabled in test-infra for at least two Linux distributions. e2e suites will be run against them. +- Data should be gathered from a number of use cases to guide beta graduation and further development efforts. + +Once this data is available, additional test plans should be added for the next phase of graduation. + +### Graduation Criteria + +#### Alpha + +- Kubelet can be started with swap enabled. +- KubeletConfig allows CRI to be configured with a percentage of swap available to workloads. This will default to 0. +- e2e test jobs are configured for Linux systems with swap enabled. + + +#### Beta + +(Tentative.) + +- Determine a set of metrics for node QoS in order to evaluate the performance of nodes with and without swap enabled. +- Collect feedback from test user cases. +- Improve coverage for appropriate scenarios in testgrid. + +#### GA + +- Remove feature flag. + +### Upgrade / Downgrade Strategy + + + +No changes are required on upgrade to maintain previous behaviour. + +It is possible to downgrade a kubelet on a node that was using swap, but this would require disabling the use of swap and setting `swapoff` on the node. + +### Version Skew Strategy + + + +Feature flag will apply to kubelet only, so version skew strategy is N/A. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: NodeSwapEnabled + - Components depending on the feature gate: Kubelet +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). + +###### Does enabling the feature change any default behavior? + + + +No. If the feature flag is enabled, the user must still set `--fail-swap-on=false` to adjust the default behaviour. + +A node must have swap provisioned and available for this feature to work. If there is no swap available, but the feature flag is set to true, there will still be no change in existing behaviour. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +No. I don’t think it makes much sense to be able to provide users a meaningful ability to disable the feature flag at runtime, as this would be highly disruptive to workloads and difficult to implement. To turn this off, the kubelet would need to be restarted. + +###### What happens if we reenable the feature if it was previously rolled back? + +N/A + +###### Are there any tests for feature enablement/disablement? + + + +N/A. This should be tested separately for scenarios with the flag enabled and disabled. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs? + + + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +No. + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +No. + +###### Will enabling / using this feature result in introducing new API types? + + + +No. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +The KubeletConfig API object may slightly increase in size due to new config fields. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +It is possible for this feature to affect performance of some worker node-level SLIs/SLOs. We will need to monitor for differences, particularly during beta testing, when evaluating this feature for beta and graduation. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +Yes. It will permit the utilization of swap memory (i.e. disk) on nodes. This is expected, as this enhancement is enabling cluster administrators to access this resource. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + +- **2015-04-24:** Discussed in [#7294](https://github.com/kubernetes/kubernetes/issues/7294). +- **2017-10-06:** Discussed in [#53533](https://github.com/kubernetes/kubernetes/issues/53533). +- **2021-01-05:** Initial design discussion document for swap support and use cases. +- **2021-04-05:** Alpha KEP drafted for initial node-level swap support and implementation (KEP-2400). + +## Drawbacks + + + +When swap is enabled, particularly for workloads, the kubelet’s resource accounting may become much less accurate. This may make cluster administration more difficult and less predictable. + +Currently, there exists an unsupported workaround, which is setting the kubelet flag `--fail-swap-on` to false. + +## Alternatives + +### Just set `--fail-swap-on=false` + +This is insufficient for most use cases because there is inconsistent control over how swap will be used by various container runtimes. Dockershim currently sets swap available for workloads to 0. The CRI does not restrict it at all. This inconsistency makes it difficult or impossible to use swap in production, particularly if a user wants to restrict workloads from using swap when using the CRI rather than dockershim. + +## Infrastructure Needed (Optional) + + + +We may need Linux VM images built with swap partitions for e2e testing in CI. diff --git a/keps/sig-node/2400-node-swap/kep.yaml b/keps/sig-node/2400-node-swap/kep.yaml new file mode 100644 index 00000000000..72e9ea2bebd --- /dev/null +++ b/keps/sig-node/2400-node-swap/kep.yaml @@ -0,0 +1,38 @@ +title: Node system swap support +kep-number: 2400 +authors: + - "@ehashman" +owning-sig: sig-node +participating-sigs: + - sig-node +status: provisional +creation-date: 2021-04-06 +reviewers: + - TBD +approvers: + - "@derekwaynecarr" + - "@dchen1107" +prr-approvers: + - "@johnbelamaric" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.22" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.22" + beta: "v1.23" + stable: "v1.24" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: NodeSwapEnabled + components: + - kubelet +disable-supported: false From 3577d17d4e4b06366a01faf4ed756c28c6bae30e Mon Sep 17 00:00:00 2001 From: Ike Ma Date: Wed, 28 Apr 2021 17:09:07 -0700 Subject: [PATCH 2/9] Implementation details from Ike Ma Co-Authored-By: Elana Hashman --- keps/sig-node/2400-node-swap/README.md | 109 +++++++++++++++++++++++-- keps/sig-node/2400-node-swap/kep.yaml | 1 + 2 files changed, 103 insertions(+), 7 deletions(-) diff --git a/keps/sig-node/2400-node-swap/README.md b/keps/sig-node/2400-node-swap/README.md index 3386e3fa890..a3835f50cba 100644 --- a/keps/sig-node/2400-node-swap/README.md +++ b/keps/sig-node/2400-node-swap/README.md @@ -220,13 +220,108 @@ Since swap provisioning is out of scope of this proposal, this enhancement poses ## Design Details -\[In progress\] - -Need to add specifics here for: - -- Changes to `--fail-on-swap` flag -- CRI config details -- Where changes will need to be made so that dockershim and the CRI are consistent with swap control +### TL;DR + +In a nutshell, the following implementation are planned for Memory Swap Support +in 1.22 GKE alpha + +1. Having a feature gate `SupportNodeMemorySwap` guarding against the memory + swap support feature +2. Keep the default value of kubelet flag `--fail-on-swap` to `true` in order + to minimize the blast radius +3. Introducing two new kubelet config `MemorySwapLimit` and `Swappiness` +4. Introducing two new CRI parameter `memory_swap_limit_in_bytes` and `memory_swappiness` +5. End to end wiring from kubelet config file to CRI + +### Expected User Behaviour + +For alpha, the feature gate `SupportNodeMemorySwap` is default to disabled, and +`--fail-on-swap` flag value is the same as 1.21. Therefore, from Kubernetes +user’s perspective, no behavior changes out of the box. + +For users that are ready to explore the Memory Swap feature in 1.22 Alpha, they +will need to complete the following steps + +1. provision swap enable `SupportNodeMemorySwap` flag AND +2. set `--fail-on-swap` flag to `false` + +Then, the user can start experimenting/fine tuning kubelet configuration +`MemorySwapLimit` and/or `Swappiness` and observe the changes. + +### New Kubelet Configuration + +We will be introducing two new parameters to `KubeletConfiguration struct` +defined in +[https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/types.go](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/types.go). +These two configurations, if set, will apply to every container of the Node +where kubelet is running. + +|Name|Description|Default Value|Feature Gate| +|--- |--- |--- |--- | +|MemorySwapLimit|This parameter sets total memory limit (memory + swap). This limits the total amount of memory this container is allowed to swap to disk.|-2, which enable disable swap|SupportNodeMemorySwap| +|MemorySwappiness|This configuration sets how aggressively the kernel will swap memory pages. By default, the host kernel can swap out a percentage of anonymous pages used by a container. Users can set value between 0 and 100, to tune this percentage.|Unset, which will use host value|SupportNodeMemorySwap| + +#### MemorySwapLimit details + +MemorySwapLimit configuration is a kubelet flag that only takes effect on a +container that has a memory limit set, either explicitly from +[PodSpec]([https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits) +) or implicitly from [Resource +Quota]([https://kubernetes.io/docs/concepts/policy/resource-quotas/](https://kubernetes.io/docs/concepts/policy/resource-quotas/) +). + +For container with memory limit set, MemorySwapLimit setting will have the +following effects, [similar to +docker](https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details) + +* If MemorySwapLimit is set to a positive integer, + * If the memory limit of the container is greater or equal to + MemorySwapLimit, then no swap is allowed, the container does not have + access to swap. + * If the memory limit of the container is less than MemorySwapLimit, then + MemorySwapLimit represents the total amount of memory and swap that can be + used. For example, for a container with memory limit set to 300m, and + `MemorySwapLimit` set to 1g, the container can use 300m of memory and 700m (1g + - 300m) swap. +* If MemorySwapLimit is set to 0, for containers with memory limit is set, the + container can use as much swap as the Memory limit setting, if the host + container has swap memory configured. For instance, if a container requests + memory="300m" and MemorySwapLimit is not set, the container can use 600m in + total of memory and swap. +* If MemorySwapLimit is explicitly set to -1, the container is allowed to use + unlimited swap, up to the amount available on the host system. +* If MemorySwapLimit is explicitly set to -2, the container does not have + access to swap. This value effectively prevents a container from using swap. + +In summary, for users experimenting with this feature + +|MemorySwapLimit|container memory limit (explicit or implicit)|Expected Behavior|Comment| +|--- |--- |--- |--- | +|Any|not set|N/A|Same as docker| +|-2|N|no swap allowed, this is the default value|| +|-1|N|unlimited swap|Same as docker| +|0|N|container can use up to N swap (ie: 2N memory+swap)|Same as docker| +|X where X > 0|N where N < X|container can use up to X-N swap (ie: 2N memory+swap)|Same as docker| +|X where X > 0|N where N >= X|no swap allowed (ie: N memory only)|Same as docker| + +#### MemorySwappiness details + +* A value of 0 turns off anonymous page swapping. +* A value of 100 sets all anonymous pages as swappable. +* By default, if you do not set MemorySwappiness, the value is inherited from + the host machine. + +### CRI Changes + +We will be introducing the following two parameters +`memory_swap_limit_in_bytes` and `memory_swappiness` to `message +LinuxContainerResources` defined in +[https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L563-L580](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L563-L580) + +|Name|Type|Description|Default Value|Feature Gate| +|--- |--- |--- |--- |--- | +|`memory_swap_limit_in_bytes`|int64|set/show limit of memory+swap usage|Default 0, which is unspecified.|SupportNodeMemorySwap| +|`memory_swappiness`|int64|set/show swappiness parameter|Default 0, which is unspecified.|SupportNodeMemorySwap| ### Test Plan diff --git a/keps/sig-node/2400-node-swap/kep.yaml b/keps/sig-node/2400-node-swap/kep.yaml index 72e9ea2bebd..cc8202d6130 100644 --- a/keps/sig-node/2400-node-swap/kep.yaml +++ b/keps/sig-node/2400-node-swap/kep.yaml @@ -2,6 +2,7 @@ title: Node system swap support kep-number: 2400 authors: - "@ehashman" + - "@ike-ma" owning-sig: sig-node participating-sigs: - sig-node From 27c30381c39633f49bfa20ddc5bc95c39ad2766f Mon Sep 17 00:00:00 2001 From: Elana Hashman Date: Thu, 29 Apr 2021 16:08:56 -0700 Subject: [PATCH 3/9] Rewrite implementation details, address feedback --- keps/sig-node/2400-node-swap/README.md | 212 +++++++++++-------------- keps/sig-node/2400-node-swap/kep.yaml | 1 + 2 files changed, 97 insertions(+), 116 deletions(-) diff --git a/keps/sig-node/2400-node-swap/README.md b/keps/sig-node/2400-node-swap/README.md index a3835f50cba..46617831b20 100644 --- a/keps/sig-node/2400-node-swap/README.md +++ b/keps/sig-node/2400-node-swap/README.md @@ -1,21 +1,5 @@ # KEP-2400: Node system swap support - - - - - [Release Signoff Checklist](#release-signoff-checklist) - [Summary](#summary) @@ -34,6 +18,10 @@ tags, and then generate with `hack/update-toc.sh`. - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) + - [Enabling swap as an end user](#enabling-swap-as-an-end-user) + - [API Changes](#api-changes) + - [KubeConfig addition](#kubeconfig-addition) + - [CRI Changes](#cri-changes) - [Test Plan](#test-plan) - [Graduation Criteria](#graduation-criteria) - [Alpha](#alpha) @@ -121,20 +109,24 @@ This KEP will be limited in scope to the first two scenarios. The third can be a - On Linux systems, when swap is provisioned and available, Kubelet can start up with swap on. - Configuration is available for CRI to set swap utilization available to Kubernetes workloads, defaulting to 0 swap. - Cluster administrators can enable and configure CRI swap utilization on a per-node basis. +- Use of swap memory with both cgroupsv1 and cgroupsv2 is supported. ### Non-Goals - Provisioning swap. Swap must already be available on the system. +- Setting [swappiness]. This can already be set on a system-wide level outside of Kubernetes. - Allocating swap on a per-workload basis with accounting (e.g. pod-level specification of swap). If desired, this should be designed and implemented as part of a follow-up KEP. This KEP is a prerequisite for that work. - Supporting zram, zswap, or other memory types like SGX EPC. These could be addressed in a follow-up KEP, and are out of scope. +[swappiness]: https://en.wikipedia.org/wiki/Memory_paging#Swappiness + ## Proposal -I propose that, when swap is provisioned and available on a node, we allow cluster administrators to configure the Kubelet and CRI such that: +We propose that, when swap is provisioned and available on a node, cluster administrators can configure the Kubelet and CRI such that: - The kubelet can start with swap on. - The CRI is updated such that by default, workloads will use 0 swap. -- The CRI will have configuration available such that swap utilization can be configured for the entire node (e.g. as a percentage of pod memory requests). +- The CRI will have configuration available such that swap utilization can be configured for the entire node. This proposal enables scenarios 1 and 2 above, but not 3. @@ -201,133 +193,121 @@ This user story is addressed by scenario 2, and could benefit from 3. ### Notes/Constraints/Caveats (Optional) - +In changing the CRI, we must ensure that container runtime downstreams are able to support the new configurations. -### Risks and Mitigations +We considered adding parameters for both per-workload `memory-swap` and `swappiness`. These are documented as part of the Open Containers [runtime specification] for Linux memory configuration. Since `memory-swap` is a per-workload parameter, and `swappiness` is optional and can be set globally, we are choosing to only expose `memory-swap` which will adjust swap available to workloads. -Having swap available on a system reduces predictability. When swap is available to workloads, and is not accounted for on an individual workload-by-workload basis +Since we are not currently setting `memory-swap` in the CRI, the default behaviour is to allocate the same amount of swap for a workload as memory requested. We will update the default to not permit the use of swap by setting `memory-swap` equal to `limit`. -First, this risk is mitigated by preventing any workloads from using swap by default, even if it is enabled on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization. +[runtime specification]: https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory -Additionally, we mitigate this risk by quantifying system stability and then gathering test and production data to determine if system stability remains the same or is improved when swap is available to the system and/or workloads. +### Risks and Mitigations -Since swap provisioning is out of scope of this proposal, this enhancement poses little risk to Kubernetes clusters that will not enable swap. +Having swap available on a system reduces predictability. Swap's performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure, and applications cannot directly control what portions of their memory usage are swapped out. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage. -## Design Details +This risk is mitigated by preventing any workloads from using swap by default, even if swap is enabled and available on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization. -### TL;DR +Additionally, we will mitigate this risk by determining a set of metrics to quantify system stability and then gathering test and production data to determine if system stability changes when swap is available to the system and/or workloads in a number of different scenarios. -In a nutshell, the following implementation are planned for Memory Swap Support -in 1.22 GKE alpha +Since swap provisioning is out of scope of this proposal, this enhancement poses low risk to Kubernetes clusters that will not enable swap. -1. Having a feature gate `SupportNodeMemorySwap` guarding against the memory - swap support feature -2. Keep the default value of kubelet flag `--fail-on-swap` to `true` in order - to minimize the blast radius -3. Introducing two new kubelet config `MemorySwapLimit` and `Swappiness` -4. Introducing two new CRI parameter `memory_swap_limit_in_bytes` and `memory_swappiness` -5. End to end wiring from kubelet config file to CRI +## Design Details -### Expected User Behaviour +We summarize the implementation plan as following: -For alpha, the feature gate `SupportNodeMemorySwap` is default to disabled, and -`--fail-on-swap` flag value is the same as 1.21. Therefore, from Kubernetes -user’s perspective, no behavior changes out of the box. +1. Add a feature gate `NodeSwapEnabled` to enable swap support. +1. Leave the default value of kubelet flag `--fail-on-swap` to `true`, to avoid changing default behaviour. +1. Introduce a new kubelet config parameter, `MemorySwapLimit`. +1. Introduce a new CRI parameter, `memory_swap_limit_in_bytes`. +1. Integrate new kubelet config and pass values to CRI for container creation. +1. Ensure container runtimes are updated so they can make use of the new CRI configuration. -For users that are ready to explore the Memory Swap feature in 1.22 Alpha, they -will need to complete the following steps +### Enabling swap as an end user -1. provision swap enable `SupportNodeMemorySwap` flag AND -2. set `--fail-on-swap` flag to `false` +Swap can be enabled as follows: -Then, the user can start experimenting/fine tuning kubelet configuration -`MemorySwapLimit` and/or `Swappiness` and observe the changes. +1. Provision swap on the target worker nodes, +1. Enable `NodeMemorySwap` flag on the kubelet, +1. Set `--fail-on-swap` flag to `false`, and +1. (Optional) Configure `MemorySwapLimit` in the KubeletConfig for tuning. -### New Kubelet Configuration +### API Changes -We will be introducing two new parameters to `KubeletConfiguration struct` -defined in -[https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/types.go](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/types.go). -These two configurations, if set, will apply to every container of the Node -where kubelet is running. +#### KubeConfig addition -|Name|Description|Default Value|Feature Gate| -|--- |--- |--- |--- | -|MemorySwapLimit|This parameter sets total memory limit (memory + swap). This limits the total amount of memory this container is allowed to swap to disk.|-2, which enable disable swap|SupportNodeMemorySwap| -|MemorySwappiness|This configuration sets how aggressively the kernel will swap memory pages. By default, the host kernel can swap out a percentage of anonymous pages used by a container. Users can set value between 0 and 100, to tune this percentage.|Unset, which will use host value|SupportNodeMemorySwap| +We will add an optional `MemorySwapLimit` value to the `KubeletConfig` struct in [pkg/kubelet/apis/config/types.go] for a compatible API change as follows: -#### MemorySwapLimit details +[pkg/kubelet/apis/config/types.go]: https://github.com/kubernetes/kubernetes/blob/6baad0a1d45435ff5844061aebab624c89d698f8/pkg/kubelet/apis/config/types.go#L81 -MemorySwapLimit configuration is a kubelet flag that only takes effect on a -container that has a memory limit set, either explicitly from -[PodSpec]([https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits) -) or implicitly from [Resource -Quota]([https://kubernetes.io/docs/concepts/policy/resource-quotas/](https://kubernetes.io/docs/concepts/policy/resource-quotas/) -). +```go +// KubeletConfiguration contains the configuration for the Kubelet +type KubeletConfiguration struct { + metav1.TypeMeta +... + // Configure swap memory available to container workloads. + // If not set, workloads cannot use swap. + // If set to 0, workloads can use as much swap as their memory limit. + // If set to -1, workloads can use unlimited swap, up to the system limit. + // If set to a positive integer, workloads can use a total of memory and swap up to this + // limit. When containers request more memory than this limit, they cannot use swap. + // +featureGate=NodeSwapEnabled + // +optional + MemorySwapLimit *int64 +} +``` For container with memory limit set, MemorySwapLimit setting will have the -following effects, [similar to -docker](https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details) - -* If MemorySwapLimit is set to a positive integer, - * If the memory limit of the container is greater or equal to - MemorySwapLimit, then no swap is allowed, the container does not have - access to swap. - * If the memory limit of the container is less than MemorySwapLimit, then - MemorySwapLimit represents the total amount of memory and swap that can be - used. For example, for a container with memory limit set to 300m, and - `MemorySwapLimit` set to 1g, the container can use 300m of memory and 700m (1g - - 300m) swap. -* If MemorySwapLimit is set to 0, for containers with memory limit is set, the - container can use as much swap as the Memory limit setting, if the host - container has swap memory configured. For instance, if a container requests - memory="300m" and MemorySwapLimit is not set, the container can use 600m in - total of memory and swap. -* If MemorySwapLimit is explicitly set to -1, the container is allowed to use - unlimited swap, up to the amount available on the host system. -* If MemorySwapLimit is explicitly set to -2, the container does not have - access to swap. This value effectively prevents a container from using swap. - -In summary, for users experimenting with this feature - -|MemorySwapLimit|container memory limit (explicit or implicit)|Expected Behavior|Comment| -|--- |--- |--- |--- | -|Any|not set|N/A|Same as docker| -|-2|N|no swap allowed, this is the default value|| -|-1|N|unlimited swap|Same as docker| -|0|N|container can use up to N swap (ie: 2N memory+swap)|Same as docker| -|X where X > 0|N where N < X|container can use up to X-N swap (ie: 2N memory+swap)|Same as docker| -|X where X > 0|N where N >= X|no swap allowed (ie: N memory only)|Same as docker| - -#### MemorySwappiness details - -* A value of 0 turns off anonymous page swapping. -* A value of 100 sets all anonymous pages as swappable. -* By default, if you do not set MemorySwappiness, the value is inherited from - the host machine. +following effects, following the [Docker] and open container specification: + +* If `MemorySwapLimit` is not set, containers do not have access to swap. This + value effectively prevents a container from using swap, even if it is enabled + on a system. +* If `MemorySwapLimit` is set to 0, for containers with memory limit is set, the + container can use as much swap as its memory limit setting. For instance, if + a container requests 300Mi memory and `MemorySwapLimit` is not set, the + container can use 600Mi total memory and swap. +* If `MemorySwapLimit` is set to -1, the container is allowed to use + unlimited swap, up to the maximum amount available on the host system. +* If `MemorySwapLimit` is set to a positive integer, then for containers with a + memory limit set, that value represents the system-wide maximum limit for + combined memory and swap usage of a container. For example, if + `MemorySwapLimit` is set to 1073742000 (1Gi): + * If the container's memory limit is 300Mi, it can use 1Gi combined memory + and swap (e.g. up to 700Mi swap). + * If the container's memory limit is 700Mi, it can use 1Gi combined memory + and swap (e.g. up to 300Mi swap). + * If the container's memory limit is 1Gi or greater, it cannot use swap. + +[docker]: https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details ### CRI Changes -We will be introducing the following two parameters -`memory_swap_limit_in_bytes` and `memory_swappiness` to `message -LinuxContainerResources` defined in -[https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L563-L580](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L563-L580) - -|Name|Type|Description|Default Value|Feature Gate| -|--- |--- |--- |--- |--- | -|`memory_swap_limit_in_bytes`|int64|set/show limit of memory+swap usage|Default 0, which is unspecified.|SupportNodeMemorySwap| -|`memory_swappiness`|int64|set/show swappiness parameter|Default 0, which is unspecified.|SupportNodeMemorySwap| +The CRI requires a corresponding change in order to allow the kubelet to set swap usage in container runtimes. +We will introduce a parameter `memory_swap_limit_in_bytes` to the CRI API (found in [k8s.io/cri-api/pkg/apis/runtime/v1/api.proto]): + +[k8s.io/cri-api/pkg/apis/runtime/v1/api.proto]: https://github.com/kubernetes/kubernetes/blob/6baad0a1d45435ff5844061aebab624c89d698f8/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L563-L580 + +```go +// LinuxContainerResources specifies Linux specific configuration for +// resources. +message LinuxContainerResources { +... + // Memory limit in bytes. Default: 0 (not specified). + int64 memory_limit_in_bytes = 4; + // Memory + swap limit in bytes. Default: 0 (not specified). + int64 memory_swap_limit_in_bytes = 9; +... + // List of HugepageLimits to limit the HugeTLB usage of container per page size. Default: nil (not specified). + repeated HugepageLimit hugepage_limits = 8; +} +``` ### Test Plan For alpha: - Swap scenarios are enabled in test-infra for at least two Linux distributions. e2e suites will be run against them. + - Container runtimes must be bumped in CI to use the new CRI. - Data should be gathered from a number of use cases to guide beta graduation and further development efforts. Once this data is available, additional test plans should be added for the next phase of graduation. @@ -426,7 +406,7 @@ Pick one of these and delete the rest. - [x] Feature gate (also fill in values in `kep.yaml`) - Feature gate name: NodeSwapEnabled - - Components depending on the feature gate: Kubelet + - Components depending on the feature gate: API Server, Kubelet - [ ] Other - Describe the mechanism: - Will enabling / disabling the feature require downtime of the control diff --git a/keps/sig-node/2400-node-swap/kep.yaml b/keps/sig-node/2400-node-swap/kep.yaml index cc8202d6130..ad4b52d3082 100644 --- a/keps/sig-node/2400-node-swap/kep.yaml +++ b/keps/sig-node/2400-node-swap/kep.yaml @@ -35,5 +35,6 @@ milestone: feature-gates: - name: NodeSwapEnabled components: + - kube-apiserver - kubelet disable-supported: false From 15eabc886fa19ab5a60abb842369df1931770a54 Mon Sep 17 00:00:00 2001 From: Elana Hashman Date: Tue, 4 May 2021 16:22:38 -0700 Subject: [PATCH 4/9] Update KEP with API review feedback --- keps/sig-node/2400-node-swap/README.md | 61 ++++++++++++++++---------- 1 file changed, 38 insertions(+), 23 deletions(-) diff --git a/keps/sig-node/2400-node-swap/README.md b/keps/sig-node/2400-node-swap/README.md index 46617831b20..76a94b806b1 100644 --- a/keps/sig-node/2400-node-swap/README.md +++ b/keps/sig-node/2400-node-swap/README.md @@ -97,7 +97,7 @@ There are hence a number of possible ways that one could envision swap use on a ### Scenarios -1. Swap is enabled only at the system level. The CRI does not permit user workloads to use swap. (This scenario is a prerequisite for the following use cases.) +1. Swap is enabled on a node's host system, but the CRI does not permit Kubernetes workloads to use swap. (This scenario is a prerequisite for the following use cases.) 1. Swap is enabled at the node level. The CRI can be globally configured to permit user workloads scheduled on the node to use some quantity of swap. 1. Swap is set on a per-workload basis. The CRI sets permitted swap utilization on each individual workload. @@ -245,33 +245,46 @@ type KubeletConfiguration struct { metav1.TypeMeta ... // Configure swap memory available to container workloads. - // If not set, workloads cannot use swap. - // If set to 0, workloads can use as much swap as their memory limit. - // If set to -1, workloads can use unlimited swap, up to the system limit. - // If set to a positive integer, workloads can use a total of memory and swap up to this - // limit. When containers request more memory than this limit, they cannot use swap. // +featureGate=NodeSwapEnabled // +optional - MemorySwapLimit *int64 + MemorySwap MemorySwapConfiguration +} + +type MemorySwapConfiguration struct { + // Configure swap memory available to container workloads. May be one of + // "", "NoSwap": workloads cannot use swap + // "WorkloadSpecifiedSwapLimit": workloads can use as much swap as their memory limit. + // "UnlimitedSwap": workloads can use unlimited swap, up to the system limit. + // "LimitedSwap": workloads can use a total of memory and swap up to this + // limit. When containers request more memory than this limit, they cannot use swap. + SwapBehavior string + + LimitedSwap *LimitedSwapConfiguration +} + +type LimitedSwapConfiguration struct { + PerWorkloadMemorySwapLimit resource.Quantity } ``` -For container with memory limit set, MemorySwapLimit setting will have the -following effects, following the [Docker] and open container specification: - -* If `MemorySwapLimit` is not set, containers do not have access to swap. This - value effectively prevents a container from using swap, even if it is enabled - on a system. -* If `MemorySwapLimit` is set to 0, for containers with memory limit is set, the - container can use as much swap as its memory limit setting. For instance, if - a container requests 300Mi memory and `MemorySwapLimit` is not set, the - container can use 600Mi total memory and swap. -* If `MemorySwapLimit` is set to -1, the container is allowed to use - unlimited swap, up to the maximum amount available on the host system. -* If `MemorySwapLimit` is set to a positive integer, then for containers with a - memory limit set, that value represents the system-wide maximum limit for - combined memory and swap usage of a container. For example, if - `MemorySwapLimit` is set to 1073742000 (1Gi): +The `MemorySwapConfiguration.SwapBehavior` setting will have the following +effects, based on the [Docker] and open container specification for the +`--memory-swap` flag: + +* If `SwapBehavior` is not set or set to `"NoSwap"`, containers do not have + access to swap. This value effectively prevents a container from using swap, + even if it is enabled on a system. +* If `SwapBehavior` is set to `"WorkloadSpecifiedSwapLimit"`, then for + containers with memory limit is set, the container can use as much swap as + its memory limit setting. For instance, if a container requests 300Mi memory + and `MemorySwapLimit` is not set, the container can use 600Mi total memory + and swap. +* If `SwapBehavior` is set to `"UnlimitedSwap"`, the container is allowed to + use unlimited swap, up to the maximum amount available on the host system. +* If `SwapBehavior` is set to a `"LimitedSwap"`, then the `LimitedSwap` + configuration must also be set. `LimitedSwap.PerWorkloadMemorySwapLimit` + represents the system-wide maximum limit for combined memory and swap usage + of a container. For example, if the limit is set to `1Gi`: * If the container's memory limit is 300Mi, it can use 1Gi combined memory and swap (e.g. up to 700Mi swap). * If the container's memory limit is 700Mi, it can use 1Gi combined memory @@ -309,6 +322,7 @@ For alpha: - Swap scenarios are enabled in test-infra for at least two Linux distributions. e2e suites will be run against them. - Container runtimes must be bumped in CI to use the new CRI. - Data should be gathered from a number of use cases to guide beta graduation and further development efforts. + - Focus should be on supported user stories as listed above. Once this data is available, additional test plans should be added for the next phase of graduation. @@ -331,6 +345,7 @@ Once this data is available, additional test plans should be added for the next #### GA +- Test a wide variety of scenarios that may be affected by swap support, such as workloads using tmpfs storage. - Remove feature flag. ### Upgrade / Downgrade Strategy From 7b58bd2b3e74485f274ba5ca24d82601bbf7578d Mon Sep 17 00:00:00 2001 From: Elana Hashman Date: Wed, 5 May 2021 08:16:48 -0700 Subject: [PATCH 5/9] Update metadata and add PRR --- keps/prod-readiness/sig-node/2400.yaml | 3 +++ keps/sig-node/2400-node-swap/kep.yaml | 9 ++++++--- 2 files changed, 9 insertions(+), 3 deletions(-) create mode 100644 keps/prod-readiness/sig-node/2400.yaml diff --git a/keps/prod-readiness/sig-node/2400.yaml b/keps/prod-readiness/sig-node/2400.yaml new file mode 100644 index 00000000000..1eb33a410c5 --- /dev/null +++ b/keps/prod-readiness/sig-node/2400.yaml @@ -0,0 +1,3 @@ +kep-number: 2400 +alpha: + approver: "@deads2k" diff --git a/keps/sig-node/2400-node-swap/kep.yaml b/keps/sig-node/2400-node-swap/kep.yaml index ad4b52d3082..b514f563426 100644 --- a/keps/sig-node/2400-node-swap/kep.yaml +++ b/keps/sig-node/2400-node-swap/kep.yaml @@ -6,15 +6,18 @@ authors: owning-sig: sig-node participating-sigs: - sig-node -status: provisional +status: implementable creation-date: 2021-04-06 reviewers: - - TBD + - "@SergeyKanzhelev" + - "@anguslees" + - "@deads2k" + - "@sftim" approvers: - "@derekwaynecarr" - "@dchen1107" prr-approvers: - - "@johnbelamaric" + - "@deads2k" # The target maturity stage in the current dev cycle for this KEP. stage: alpha From 84adca77276a05ee5abab3aab48dceb525274381 Mon Sep 17 00:00:00 2001 From: Elana Hashman Date: Fri, 7 May 2021 14:22:25 -0700 Subject: [PATCH 6/9] Reflow to 80-char line width --- keps/sig-node/2400-node-swap/README.md | 237 ++++++++++++++++++------- 1 file changed, 173 insertions(+), 64 deletions(-) diff --git a/keps/sig-node/2400-node-swap/README.md b/keps/sig-node/2400-node-swap/README.md index 76a94b806b1..14355f61192 100644 --- a/keps/sig-node/2400-node-swap/README.md +++ b/keps/sig-node/2400-node-swap/README.md @@ -83,50 +83,76 @@ Items marked with (R) are required *prior to targeting to a milestone / release* ## Summary -Kubernetes currently does not support the use of [swap memory](https://en.wikipedia.org/wiki/Paging#Linux) on Linux, as it is difficult to provide guarantees and account for pod memory utilization when swap is involved. As part of Kubernetes’ earlier design, [swap support was considered out of scope](https://github.com/kubernetes/kubernetes/issues/7294). - -However, there are a [number of use cases](#user-stories) that would benefit from Kubernetes nodes supporting swap. Hence, this proposal aims to add swap support to nodes in a controlled, predictable manner so that Kubernetes users can perform testing and provide data to continue building cluster capabilities on top of swap. +Kubernetes currently does not support the use of [swap +memory](https://en.wikipedia.org/wiki/Paging#Linux) on Linux, as it is +difficult to provide guarantees and account for pod memory utilization when +swap is involved. As part of Kubernetes’ earlier design, [swap support was +considered out of scope](https://github.com/kubernetes/kubernetes/issues/7294). + +However, there are a [number of use cases](#user-stories) that would benefit +from Kubernetes nodes supporting swap. Hence, this proposal aims to add swap +support to nodes in a controlled, predictable manner so that Kubernetes users +can perform testing and provide data to continue building cluster capabilities +on top of swap. ## Motivation There are two distinct types of user for swap, who may overlap: -- node administrators, who may want swap available for node-level performance tuning and stability/reducing noisy neighbour issues -- application developers, who have written applications that would benefit from using swap memory +- node administrators, who may want swap available for node-level performance + tuning and stability/reducing noisy neighbour issues +- application developers, who have written applications that would benefit from + using swap memory -There are hence a number of possible ways that one could envision swap use on a node. +There are hence a number of possible ways that one could envision swap use on a +node. ### Scenarios -1. Swap is enabled on a node's host system, but the CRI does not permit Kubernetes workloads to use swap. (This scenario is a prerequisite for the following use cases.) -1. Swap is enabled at the node level. The CRI can be globally configured to permit user workloads scheduled on the node to use some quantity of swap. -1. Swap is set on a per-workload basis. The CRI sets permitted swap utilization on each individual workload. +1. Swap is enabled on a node's host system, but the CRI does not permit + Kubernetes workloads to use swap. (This scenario is a prerequisite for the + following use cases.) +1. Swap is enabled at the node level. The CRI can be globally configured to + permit user workloads scheduled on the node to use some quantity of swap. +1. Swap is set on a per-workload basis. The CRI sets permitted swap utilization + on each individual workload. -This KEP will be limited in scope to the first two scenarios. The third can be addressed in a follow-up KEP. The enablement work that is in scope for this KEP will be necessary to implement the third scenario. +This KEP will be limited in scope to the first two scenarios. The third can be +addressed in a follow-up KEP. The enablement work that is in scope for this KEP +will be necessary to implement the third scenario. ### Goals -- On Linux systems, when swap is provisioned and available, Kubelet can start up with swap on. -- Configuration is available for CRI to set swap utilization available to Kubernetes workloads, defaulting to 0 swap. -- Cluster administrators can enable and configure CRI swap utilization on a per-node basis. +- On Linux systems, when swap is provisioned and available, Kubelet can start + up with swap on. +- Configuration is available for CRI to set swap utilization available to + Kubernetes workloads, defaulting to 0 swap. +- Cluster administrators can enable and configure CRI swap utilization on a + per-node basis. - Use of swap memory with both cgroupsv1 and cgroupsv2 is supported. ### Non-Goals - Provisioning swap. Swap must already be available on the system. -- Setting [swappiness]. This can already be set on a system-wide level outside of Kubernetes. -- Allocating swap on a per-workload basis with accounting (e.g. pod-level specification of swap). If desired, this should be designed and implemented as part of a follow-up KEP. This KEP is a prerequisite for that work. -- Supporting zram, zswap, or other memory types like SGX EPC. These could be addressed in a follow-up KEP, and are out of scope. +- Setting [swappiness]. This can already be set on a system-wide level outside + of Kubernetes. +- Allocating swap on a per-workload basis with accounting (e.g. pod-level + specification of swap). If desired, this should be designed and implemented + as part of a follow-up KEP. This KEP is a prerequisite for that work. +- Supporting zram, zswap, or other memory types like SGX EPC. These could be + addressed in a follow-up KEP, and are out of scope. [swappiness]: https://en.wikipedia.org/wiki/Memory_paging#Swappiness ## Proposal -We propose that, when swap is provisioned and available on a node, cluster administrators can configure the Kubelet and CRI such that: +We propose that, when swap is provisioned and available on a node, cluster +administrators can configure the Kubelet and CRI such that: - The kubelet can start with swap on. - The CRI is updated such that by default, workloads will use 0 swap. -- The CRI will have configuration available such that swap utilization can be configured for the entire node. +- The CRI will have configuration available such that swap utilization can be + configured for the entire node. This proposal enables scenarios 1 and 2 above, but not 3. @@ -134,7 +160,9 @@ This proposal enables scenarios 1 and 2 above, but not 3. #### Improved Node Stability -cgroupsv2 improved memory management algos, such as oomd, currently require swap. Hence, having a small amount of swap available on nodes could improve better resource pressure handling and recovery. +cgroupsv2 improved memory management algos, such as oomd, currently require +swap. Hence, having a small amount of swap available on nodes could improve +better resource pressure handling and recovery. - https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#id1 - https://chrisdown.name/2018/01/02/in-defence-of-swap.html @@ -145,28 +173,46 @@ This user story is addressed by scenario 1 and 2, and could benefit from 3. #### Long-running applications that swap out startup memory -- Applications such as the Java and Node runtimes rely on swap for optimal performance https://github.com/kubernetes/kubernetes/issues/53533#issue-263475425 -- Initialization logic of applications can be safely swapped out without affecting long-running application resource usage https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-615967154 +- Applications such as the Java and Node runtimes rely on swap for optimal + performance + https://github.com/kubernetes/kubernetes/issues/53533#issue-263475425 +- Initialization logic of applications can be safely swapped out without + affecting long-running application resource usage + https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-615967154 This user story is addressed by scenario 2, and could benefit from 3. #### Memory Flexibility -This user story addresses cases in which cost of additional memory is prohibitive, or elastic scaling is impossible (e.g. on-premise/bare metal deployments). - -- Occasional cron job with high memory usage and lack of swap support means cloud nodes must always be allocated for maximum possible memory utilization, leading to overprovisioning/high costs https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-354832960 -- Lack of swap support would require provisioning 3x the amount of memory as required with swap https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-617654228 -- On-premise deployment can’t horizontally scale available memory based on load https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-637715138 -- Scaling resources is technically feasible but cost-prohibitive, swap provides flexibility at lower cost https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-553713502 +This user story addresses cases in which cost of additional memory is +prohibitive, or elastic scaling is impossible (e.g. on-premise/bare metal +deployments). + +- Occasional cron job with high memory usage and lack of swap support means + cloud nodes must always be allocated for maximum possible memory utilization, + leading to overprovisioning/high costs + https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-354832960 +- Lack of swap support would require provisioning 3x the amount of memory as + required with swap + https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-617654228 +- On-premise deployment can’t horizontally scale available memory based on load + https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-637715138 +- Scaling resources is technically feasible but cost-prohibitive, swap provides + flexibility at lower cost + https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-553713502 This user story is addressed by scenario 2, and could benefit from 3. #### Local development and systems with fast storage -Local development or single-node clusters and systems with fast storage may benefit from using available swap (e.g. NVMe swap partitions, one-node clusters). +Local development or single-node clusters and systems with fast storage may +benefit from using available swap (e.g. NVMe swap partitions, one-node +clusters). -- Single node, local Kubernetes deployment on laptop https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-361748518 -- Linux has optimizations for swap on SSD, allowing for performance boosts https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-589275277 +- Single node, local Kubernetes deployment on laptop + https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-361748518 +- Linux has optimizations for swap on SSD, allowing for performance boosts + https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-589275277 This user story is addressed by scenarios 1 and 2, and could benefit from 3. @@ -174,49 +220,82 @@ This user story is addressed by scenarios 1 and 2, and could benefit from 3. For example, edge devices with limited memory. -- Edge compute systems/devices with small memory footprints (\<2Gi) https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-751398086 -- Clusters with nodes \<4Gi memory https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-751404417 +- Edge compute systems/devices with small memory footprints (\<2Gi) + https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-751398086 +- Clusters with nodes \<4Gi memory + https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-751404417 This user story is addressed by scenario 2, and could benefit from 3. #### Virtualization management overhead -This would apply to virtualized Kubernetes workloads such as VMs launched by kubevirt. +This would apply to virtualized Kubernetes workloads such as VMs launched by +kubevirt. -Every VM comes with a management related overhead which can sporadically be pretty significant (memory streaming, SRIOV attachment, gpu attachment, virtio-fs, …). Swap helps to not request much more memory to deal with short term worst-case scenarios. +Every VM comes with a management related overhead which can sporadically be +pretty significant (memory streaming, SRIOV attachment, gpu attachment, +virtio-fs, …). Swap helps to not request much more memory to deal with short +term worst-case scenarios. -With virtualization, clusters are typically provisioned based on the workloads’ memory consumption, and any infrastructure container overhead is overcommitted. This overhead could be safely swapped out. +With virtualization, clusters are typically provisioned based on the workloads’ +memory consumption, and any infrastructure container overhead is overcommitted. +This overhead could be safely swapped out. -- Required for live migration of VMs https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-754878431 +- Required for live migration of VMs + https://github.com/kubernetes/kubernetes/issues/53533#issuecomment-754878431 This user story is addressed by scenario 2, and could benefit from 3. ### Notes/Constraints/Caveats (Optional) -In changing the CRI, we must ensure that container runtime downstreams are able to support the new configurations. +In changing the CRI, we must ensure that container runtime downstreams are able +to support the new configurations. -We considered adding parameters for both per-workload `memory-swap` and `swappiness`. These are documented as part of the Open Containers [runtime specification] for Linux memory configuration. Since `memory-swap` is a per-workload parameter, and `swappiness` is optional and can be set globally, we are choosing to only expose `memory-swap` which will adjust swap available to workloads. +We considered adding parameters for both per-workload `memory-swap` and +`swappiness`. These are documented as part of the Open Containers [runtime +specification] for Linux memory configuration. Since `memory-swap` is a +per-workload parameter, and `swappiness` is optional and can be set globally, +we are choosing to only expose `memory-swap` which will adjust swap available +to workloads. -Since we are not currently setting `memory-swap` in the CRI, the default behaviour is to allocate the same amount of swap for a workload as memory requested. We will update the default to not permit the use of swap by setting `memory-swap` equal to `limit`. +Since we are not currently setting `memory-swap` in the CRI, the default +behaviour is to allocate the same amount of swap for a workload as memory +requested. We will update the default to not permit the use of swap by setting +`memory-swap` equal to `limit`. [runtime specification]: https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory ### Risks and Mitigations -Having swap available on a system reduces predictability. Swap's performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure, and applications cannot directly control what portions of their memory usage are swapped out. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage. +Having swap available on a system reduces predictability. Swap's performance is +worse than regular memory, sometimes by many orders of magnitude, which can +cause unexpected performance regressions. Furthermore, swap changes a system's +behaviour under memory pressure, and applications cannot directly control what +portions of their memory usage are swapped out. Since enabling swap permits +greater memory usage for workloads in Kubernetes that cannot be predictably +accounted for, it also increases the risk of noisy neighbours and unexpected +packing configurations, as the scheduler cannot account for swap memory usage. -This risk is mitigated by preventing any workloads from using swap by default, even if swap is enabled and available on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization. +This risk is mitigated by preventing any workloads from using swap by default, +even if swap is enabled and available on a system. This will allow a cluster +administrator to test swap utilization just at the system level without +introducing unpredictability to workload resource utilization. -Additionally, we will mitigate this risk by determining a set of metrics to quantify system stability and then gathering test and production data to determine if system stability changes when swap is available to the system and/or workloads in a number of different scenarios. +Additionally, we will mitigate this risk by determining a set of metrics to +quantify system stability and then gathering test and production data to +determine if system stability changes when swap is available to the system +and/or workloads in a number of different scenarios. -Since swap provisioning is out of scope of this proposal, this enhancement poses low risk to Kubernetes clusters that will not enable swap. +Since swap provisioning is out of scope of this proposal, this enhancement +poses low risk to Kubernetes clusters that will not enable swap. ## Design Details We summarize the implementation plan as following: 1. Add a feature gate `NodeSwapEnabled` to enable swap support. -1. Leave the default value of kubelet flag `--fail-on-swap` to `true`, to avoid changing default behaviour. +1. Leave the default value of kubelet flag `--fail-on-swap` to `true`, to avoid + changing default behaviour. 1. Introduce a new kubelet config parameter, `MemorySwapLimit`. 1. Introduce a new CRI parameter, `memory_swap_limit_in_bytes`. 1. Integrate new kubelet config and pass values to CRI for container creation. @@ -235,7 +314,8 @@ Swap can be enabled as follows: #### KubeConfig addition -We will add an optional `MemorySwapLimit` value to the `KubeletConfig` struct in [pkg/kubelet/apis/config/types.go] for a compatible API change as follows: +We will add an optional `MemorySwapLimit` value to the `KubeletConfig` struct +in [pkg/kubelet/apis/config/types.go] for a compatible API change as follows: [pkg/kubelet/apis/config/types.go]: https://github.com/kubernetes/kubernetes/blob/6baad0a1d45435ff5844061aebab624c89d698f8/pkg/kubelet/apis/config/types.go#L81 @@ -295,8 +375,10 @@ effects, based on the [Docker] and open container specification for the ### CRI Changes -The CRI requires a corresponding change in order to allow the kubelet to set swap usage in container runtimes. -We will introduce a parameter `memory_swap_limit_in_bytes` to the CRI API (found in [k8s.io/cri-api/pkg/apis/runtime/v1/api.proto]): +The CRI requires a corresponding change in order to allow the kubelet to set +swap usage in container runtimes. We will introduce a parameter +`memory_swap_limit_in_bytes` to the CRI API (found in +[k8s.io/cri-api/pkg/apis/runtime/v1/api.proto]): [k8s.io/cri-api/pkg/apis/runtime/v1/api.proto]: https://github.com/kubernetes/kubernetes/blob/6baad0a1d45435ff5844061aebab624c89d698f8/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L563-L580 @@ -319,19 +401,23 @@ message LinuxContainerResources { For alpha: -- Swap scenarios are enabled in test-infra for at least two Linux distributions. e2e suites will be run against them. +- Swap scenarios are enabled in test-infra for at least two Linux + distributions. e2e suites will be run against them. - Container runtimes must be bumped in CI to use the new CRI. -- Data should be gathered from a number of use cases to guide beta graduation and further development efforts. +- Data should be gathered from a number of use cases to guide beta graduation + and further development efforts. - Focus should be on supported user stories as listed above. -Once this data is available, additional test plans should be added for the next phase of graduation. +Once this data is available, additional test plans should be added for the next +phase of graduation. ### Graduation Criteria #### Alpha - Kubelet can be started with swap enabled. -- KubeletConfig allows CRI to be configured with a percentage of swap available to workloads. This will default to 0. +- KubeletConfig allows CRI to be configured with a percentage of swap available + to workloads. This will default to 0. - e2e test jobs are configured for Linux systems with swap enabled. @@ -339,13 +425,15 @@ Once this data is available, additional test plans should be added for the next (Tentative.) -- Determine a set of metrics for node QoS in order to evaluate the performance of nodes with and without swap enabled. +- Determine a set of metrics for node QoS in order to evaluate the performance + of nodes with and without swap enabled. - Collect feedback from test user cases. - Improve coverage for appropriate scenarios in testgrid. #### GA -- Test a wide variety of scenarios that may be affected by swap support, such as workloads using tmpfs storage. +- Test a wide variety of scenarios that may be affected by swap support, such + as workloads using tmpfs storage. - Remove feature flag. ### Upgrade / Downgrade Strategy @@ -364,7 +452,8 @@ enhancement: No changes are required on upgrade to maintain previous behaviour. -It is possible to downgrade a kubelet on a node that was using swap, but this would require disabling the use of swap and setting `swapoff` on the node. +It is possible to downgrade a kubelet on a node that was using swap, but this +would require disabling the use of swap and setting `swapoff` on the node. ### Version Skew Strategy @@ -436,9 +525,12 @@ Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here. --> -No. If the feature flag is enabled, the user must still set `--fail-swap-on=false` to adjust the default behaviour. +No. If the feature flag is enabled, the user must still set +`--fail-swap-on=false` to adjust the default behaviour. -A node must have swap provisioned and available for this feature to work. If there is no swap available, but the feature flag is set to true, there will still be no change in existing behaviour. +A node must have swap provisioned and available for this feature to work. If +there is no swap available, but the feature flag is set to true, there will +still be no change in existing behaviour. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? @@ -449,7 +541,10 @@ feature, can it break the existing applications?). NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. --> -No. I don’t think it makes much sense to be able to provide users a meaningful ability to disable the feature flag at runtime, as this would be highly disruptive to workloads and difficult to implement. To turn this off, the kubelet would need to be restarted. +No. I don’t think it makes much sense to be able to provide users a meaningful +ability to disable the feature flag at runtime, as this would be highly +disruptive to workloads and difficult to implement. To turn this off, the +kubelet would need to be restarted. ###### What happens if we reenable the feature if it was previously rolled back? @@ -464,7 +559,8 @@ with and without the feature, are necessary. At the very least, think about conversion tests if API types are being modified. --> -N/A. This should be tested separately for scenarios with the flag enabled and disabled. +N/A. This should be tested separately for scenarios with the flag enabled and +disabled. ### Rollout, Upgrade and Rollback Planning @@ -630,7 +726,8 @@ Describe them, providing: - Estimated amount of new objects: (e.g., new Object X for every existing Pod) --> -The KubeletConfig API object may slightly increase in size due to new config fields. +The KubeletConfig API object may slightly increase in size due to new config +fields. ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? @@ -643,7 +740,9 @@ Think about adding additional work or introducing new steps in between [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos --> -It is possible for this feature to affect performance of some worker node-level SLIs/SLOs. We will need to monitor for differences, particularly during beta testing, when evaluating this feature for beta and graduation. +It is possible for this feature to affect performance of some worker node-level +SLIs/SLOs. We will need to monitor for differences, particularly during beta +testing, when evaluating this feature for beta and graduation. ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? @@ -657,7 +756,9 @@ This through this both in small and large cases, again with respect to the [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md --> -Yes. It will permit the utilization of swap memory (i.e. disk) on nodes. This is expected, as this enhancement is enabling cluster administrators to access this resource. +Yes. It will permit the utilization of swap memory (i.e. disk) on nodes. This +is expected, as this enhancement is enabling cluster administrators to access +this resource. ### Troubleshooting @@ -701,15 +802,23 @@ For each of them, fill in the following information by copying the below templat Why should this KEP _not_ be implemented? --> -When swap is enabled, particularly for workloads, the kubelet’s resource accounting may become much less accurate. This may make cluster administration more difficult and less predictable. +When swap is enabled, particularly for workloads, the kubelet’s resource +accounting may become much less accurate. This may make cluster administration +more difficult and less predictable. -Currently, there exists an unsupported workaround, which is setting the kubelet flag `--fail-swap-on` to false. +Currently, there exists an unsupported workaround, which is setting the kubelet +flag `--fail-swap-on` to false. ## Alternatives ### Just set `--fail-swap-on=false` -This is insufficient for most use cases because there is inconsistent control over how swap will be used by various container runtimes. Dockershim currently sets swap available for workloads to 0. The CRI does not restrict it at all. This inconsistency makes it difficult or impossible to use swap in production, particularly if a user wants to restrict workloads from using swap when using the CRI rather than dockershim. +This is insufficient for most use cases because there is inconsistent control +over how swap will be used by various container runtimes. Dockershim currently +sets swap available for workloads to 0. The CRI does not restrict it at all. +This inconsistency makes it difficult or impossible to use swap in production, +particularly if a user wants to restrict workloads from using swap when using +the CRI rather than dockershim. ## Infrastructure Needed (Optional) From 64e639aabb512ee6feffc2f481ba394b76bad7de Mon Sep 17 00:00:00 2001 From: Elana Hashman Date: Fri, 7 May 2021 14:55:11 -0700 Subject: [PATCH 7/9] Address PRR and other review comments --- keps/sig-node/2400-node-swap/README.md | 43 ++++++++++++++------------ 1 file changed, 24 insertions(+), 19 deletions(-) diff --git a/keps/sig-node/2400-node-swap/README.md b/keps/sig-node/2400-node-swap/README.md index 14355f61192..438743eb023 100644 --- a/keps/sig-node/2400-node-swap/README.md +++ b/keps/sig-node/2400-node-swap/README.md @@ -164,6 +164,7 @@ cgroupsv2 improved memory management algos, such as oomd, currently require swap. Hence, having a small amount of swap available on nodes could improve better resource pressure handling and recovery. +- https://man7.org/linux/man-pages/man8/systemd-oomd.service.8.html - https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#id1 - https://chrisdown.name/2018/01/02/in-defence-of-swap.html - https://media.ccc.de/v/ASG2018-175-oomd @@ -258,10 +259,10 @@ per-workload parameter, and `swappiness` is optional and can be set globally, we are choosing to only expose `memory-swap` which will adjust swap available to workloads. -Since we are not currently setting `memory-swap` in the CRI, the default -behaviour is to allocate the same amount of swap for a workload as memory -requested. We will update the default to not permit the use of swap by setting -`memory-swap` equal to `limit`. +Since we are not currently setting `memory-swap` in the CRI, the current +default behaviour when `--fail-swap-on=false` is set is to allocate the same +amount of swap for a workload as memory requested. We will update the default +to not permit the use of swap by setting `memory-swap` equal to `limit`. [runtime specification]: https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#memory @@ -347,9 +348,9 @@ type LimitedSwapConfiguration struct { } ``` -The `MemorySwapConfiguration.SwapBehavior` setting will have the following -effects, based on the [Docker] and open container specification for the -`--memory-swap` flag: +We want to expose all possible swap settings based on the [Docker] and open +container specification for the `--memory-swap` flag. Thus, the +`MemorySwapConfiguration.SwapBehavior` setting will have the following effects: * If `SwapBehavior` is not set or set to `"NoSwap"`, containers do not have access to swap. This value effectively prevents a container from using swap, @@ -361,7 +362,7 @@ effects, based on the [Docker] and open container specification for the and swap. * If `SwapBehavior` is set to `"UnlimitedSwap"`, the container is allowed to use unlimited swap, up to the maximum amount available on the host system. -* If `SwapBehavior` is set to a `"LimitedSwap"`, then the `LimitedSwap` +* If `SwapBehavior` is set to `"LimitedSwap"`, then the `LimitedSwap` configuration must also be set. `LimitedSwap.PerWorkloadMemorySwapLimit` represents the system-wide maximum limit for combined memory and swap usage of a container. For example, if the limit is set to `1Gi`: @@ -387,13 +388,9 @@ swap usage in container runtimes. We will introduce a parameter // resources. message LinuxContainerResources { ... - // Memory limit in bytes. Default: 0 (not specified). - int64 memory_limit_in_bytes = 4; // Memory + swap limit in bytes. Default: 0 (not specified). int64 memory_swap_limit_in_bytes = 9; ... - // List of HugepageLimits to limit the HugeTLB usage of container per page size. Default: nil (not specified). - repeated HugepageLimit hugepage_limits = 8; } ``` @@ -511,12 +508,16 @@ Pick one of these and delete the rest. - [x] Feature gate (also fill in values in `kep.yaml`) - Feature gate name: NodeSwapEnabled - Components depending on the feature gate: API Server, Kubelet -- [ ] Other - - Describe the mechanism: +- [x] Other + - Describe the mechanism: `--fail-swap-on=false` flag for kubelet must also + be set at kubelet start - Will enabling / disabling the feature require downtime of the control - plane? + plane? Yes. Flag must be set on kubelet start. To disable, kubelet must be + restarted. Hence, there would be brief control component downtime on a + given node. - Will enabling / disabling the feature require downtime or reprovisioning of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). + Yes. See above; disabling would require brief node downtime. ###### Does enabling the feature change any default behavior? @@ -541,10 +542,14 @@ feature, can it break the existing applications?). NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. --> -No. I don’t think it makes much sense to be able to provide users a meaningful -ability to disable the feature flag at runtime, as this would be highly -disruptive to workloads and difficult to implement. To turn this off, the -kubelet would need to be restarted. +No. The feature flag can be disabled while the `--fail-swap-on=false` flag is +set, but this would result in undefined behaviour. + +To turn this off, the kubelet would need to be restarted. If a cluster admin +wants to disable swap on the node without repartitioning the node, they could +stop the kubelet, set `swapoff` on the node, and restart the kubelet with +`--fail-swap-on=true`. The setting of the feature flag will be ignored in this +case. ###### What happens if we reenable the feature if it was previously rolled back? From 277c51a3f38ff53ae127868ccb3ffd8e38012427 Mon Sep 17 00:00:00 2001 From: Elana Hashman Date: Tue, 11 May 2021 15:45:17 -0700 Subject: [PATCH 8/9] Update based on reviewer feedback --- keps/sig-node/2400-node-swap/README.md | 69 +++++++++++++++----------- 1 file changed, 40 insertions(+), 29 deletions(-) diff --git a/keps/sig-node/2400-node-swap/README.md b/keps/sig-node/2400-node-swap/README.md index 438743eb023..1724c30574d 100644 --- a/keps/sig-node/2400-node-swap/README.md +++ b/keps/sig-node/2400-node-swap/README.md @@ -21,7 +21,7 @@ - [Enabling swap as an end user](#enabling-swap-as-an-end-user) - [API Changes](#api-changes) - [KubeConfig addition](#kubeconfig-addition) - - [CRI Changes](#cri-changes) + - [CRI Changes](#cri-changes) - [Test Plan](#test-plan) - [Graduation Criteria](#graduation-criteria) - [Alpha](#alpha) @@ -40,6 +40,7 @@ - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - [Just set --fail-swap-on=false](#just-set-) + - [Restrict swap usage at the cgroup level](#restrict-swap-usage-at-the-cgroup-level) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) @@ -108,7 +109,7 @@ node. ### Scenarios -1. Swap is enabled on a node's host system, but the CRI does not permit +1. Swap is enabled on a node's host system, but the kubelet does not permit Kubernetes workloads to use swap. (This scenario is a prerequisite for the following use cases.) 1. Swap is enabled at the node level. The CRI can be globally configured to @@ -125,20 +126,23 @@ will be necessary to implement the third scenario. - On Linux systems, when swap is provisioned and available, Kubelet can start up with swap on. -- Configuration is available for CRI to set swap utilization available to +- Configuration is available for kubelet to set swap utilization available to Kubernetes workloads, defaulting to 0 swap. -- Cluster administrators can enable and configure CRI swap utilization on a +- Cluster administrators can enable and configure kubelet swap utilization on a per-node basis. - Use of swap memory with both cgroupsv1 and cgroupsv2 is supported. ### Non-Goals +- Addressing non-Linux operating systems. Swap support will only be available + for Linux. - Provisioning swap. Swap must already be available on the system. - Setting [swappiness]. This can already be set on a system-wide level outside of Kubernetes. - Allocating swap on a per-workload basis with accounting (e.g. pod-level specification of swap). If desired, this should be designed and implemented - as part of a follow-up KEP. This KEP is a prerequisite for that work. + as part of a follow-up KEP. This KEP is a prerequisite for that work. Hence, + swap will be an overcommitted resource in the context of this KEP. - Supporting zram, zswap, or other memory types like SGX EPC. These could be addressed in a follow-up KEP, and are out of scope. @@ -147,12 +151,12 @@ will be necessary to implement the third scenario. ## Proposal We propose that, when swap is provisioned and available on a node, cluster -administrators can configure the Kubelet and CRI such that: +administrators can configure the kubelet such that: -- The kubelet can start with swap on. -- The CRI is updated such that by default, workloads will use 0 swap. -- The CRI will have configuration available such that swap utilization can be - configured for the entire node. +- It can start with swap on. +- It will direct the CRI to allocate Kubernetes workloads 0 swap by default. +- It will have configuration options to configure swap utilization for the + entire node. This proposal enables scenarios 1 and 2 above, but not 3. @@ -334,10 +338,8 @@ type KubeletConfiguration struct { type MemorySwapConfiguration struct { // Configure swap memory available to container workloads. May be one of // "", "NoSwap": workloads cannot use swap - // "WorkloadSpecifiedSwapLimit": workloads can use as much swap as their memory limit. - // "UnlimitedSwap": workloads can use unlimited swap, up to the system limit. - // "LimitedSwap": workloads can use a total of memory and swap up to this - // limit. When containers request more memory than this limit, they cannot use swap. + // "UnlimitedSwap": workloads can use unlimited swap, up to the allocatable limit. + // "LimitedSwap": workloads can use up to this limit of swap. SwapBehavior string LimitedSwap *LimitedSwapConfiguration @@ -348,33 +350,25 @@ type LimitedSwapConfiguration struct { } ``` -We want to expose all possible swap settings based on the [Docker] and open +We want to expose common swap configurations based on the [Docker] and open container specification for the `--memory-swap` flag. Thus, the `MemorySwapConfiguration.SwapBehavior` setting will have the following effects: * If `SwapBehavior` is not set or set to `"NoSwap"`, containers do not have access to swap. This value effectively prevents a container from using swap, even if it is enabled on a system. -* If `SwapBehavior` is set to `"WorkloadSpecifiedSwapLimit"`, then for - containers with memory limit is set, the container can use as much swap as - its memory limit setting. For instance, if a container requests 300Mi memory - and `MemorySwapLimit` is not set, the container can use 600Mi total memory - and swap. * If `SwapBehavior` is set to `"UnlimitedSwap"`, the container is allowed to use unlimited swap, up to the maximum amount available on the host system. * If `SwapBehavior` is set to `"LimitedSwap"`, then the `LimitedSwap` configuration must also be set. `LimitedSwap.PerWorkloadMemorySwapLimit` - represents the system-wide maximum limit for combined memory and swap usage - of a container. For example, if the limit is set to `1Gi`: - * If the container's memory limit is 300Mi, it can use 1Gi combined memory - and swap (e.g. up to 700Mi swap). - * If the container's memory limit is 700Mi, it can use 1Gi combined memory - and swap (e.g. up to 300Mi swap). - * If the container's memory limit is 1Gi or greater, it cannot use swap. + represents the system-wide maximum limit for swap usage of a container. Note + that this limit applies to individual containers, and not at the pod-level, + in order to be set via the CRI rather than e.g. a [pod cgroup limit]. [docker]: https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details +[pod cgroup limit]: #restrict-swap-usage-at-the-cgroup-level -### CRI Changes +#### CRI Changes The CRI requires a corresponding change in order to allow the kubelet to set swap usage in container runtimes. We will introduce a parameter @@ -417,7 +411,6 @@ phase of graduation. to workloads. This will default to 0. - e2e test jobs are configured for Linux systems with swap enabled. - #### Beta (Tentative.) @@ -825,6 +818,24 @@ This inconsistency makes it difficult or impossible to use swap in production, particularly if a user wants to restrict workloads from using swap when using the CRI rather than dockershim. +### Restrict swap usage at the cgroup level + +Setting a swap limit at the cgroup level would allow us to restrict the usage +of swap on a pod-level, rather than container-level basis. + +For alpha, we are opting for the container-level basis to simplify the +implementation (as the container runtimes already support configuration of swap +with the `memory-swap-limit` parameter). This will also provide the necessary +plumbing for container-level accounting of swap, if that is proposed in the +future. + +In beta, we may want to revisit this. + +See the [Pod Resource Management design proposal] for more background on the +cgroup limits the kubelet currently sets based on each QoS class. + +[Pod Resource Management design proposal]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/pod-resource-management.md#pod-level-cgroups + ## Infrastructure Needed (Optional)