Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KEP-2400] Node swap updates, GA criterias and clarifications #4701

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

iholder101
Copy link
Contributor

@iholder101 iholder101 commented Jun 6, 2024

  • One-line PR description:
    Add updates, GA criterias and clarifications
  • Other comments:

This PR updates the KEP in the following ways:

Emphasize that this KEP is about basic swap enablement
The original KEP indicated that pod-level swap APIs are out of scope:

- Allocating swap on a per-workload basis with accounting (e.g. pod-level
specification of swap). If desired, this should be designed and implemented
as part of a follow-up KEP. This KEP is a prerequisite for that work. Hence,
swap will be an overcommitted resource in the context of this KEP.

This KEP will be limited in scope to the first two scenarios. The third can be
addressed in a follow-up KEP. The enablement work that is in scope for this KEP
will be necessary to implement the third scenario.

However, the lack of APIs and the implicit nature of the current implementation sometimes brings suggestions to extend the API under this KEP.

This KEP focuses on a basic swap enablement. Follow-up KEPs regarding several topics (e.g. customization, zram/zswap suport, and more) will be introduced in the near future, in which we would be able to design and implement each extension in a focused way.

To ensure we're on the same page, this topic was recently raised in a sig-node meeting. In this meeting there was a very broad consensus that this approach makes sense, especially since the NodeSwap feature is important to many different parties which want it to "just work".

This PR updates the KEP to emphasize this approach.

GA criterias
The PR adds GA criterias, alongside the intent to GA in version 1.32.

Make sure PRR is ready

Updates
Since the last KEP updates, many improvements were made and many concerns were addressed. For example:

  • Memory-backed volumes
  • Added metrics
  • Kubelet Configuration examples
  • more

This PR updates the KEP to reflect these updates.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 6, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 6, 2024
@iholder101
Copy link
Contributor Author

@iholder101 iholder101 force-pushed the kep2400/post_beta2 branch 2 times, most recently from 16c4878 to b3f9708 Compare June 9, 2024 09:17
@deads2k
Copy link
Contributor

deads2k commented Jun 10, 2024

Please update https://github.com/kubernetes/enhancements/blob/master/keps/prod-readiness/sig-node/2400.yaml and update missing bits of the PRR questionaire.

@iholder101 iholder101 force-pushed the kep2400/post_beta2 branch from b3f9708 to 443db7f Compare June 18, 2024 10:47
@iholder101
Copy link
Contributor Author

Thanks @deads2k!

Please update master/keps/prod-readiness/sig-node/2400.yaml

I see you're the assigned approver for alpha/beta.
Is it OK to also assign you as the approver for GA?

and update missing bits of the PRR questionaire.

Done! PTAL :)

@iholder101 iholder101 force-pushed the kep2400/post_beta2 branch 2 times, most recently from f4af444 to 4124649 Compare June 18, 2024 10:55
@sftim
Copy link
Contributor

sftim commented Jun 24, 2024

/retitle [KEP-2400] Node swap ppdates, GA criterias and clarifications

@k8s-ci-robot k8s-ci-robot changed the title [KEP-2400] Updates, GA criterias and clarifications [KEP-2400] Node swap ppdates, GA criterias and clarifications Jun 24, 2024
@sftim
Copy link
Contributor

sftim commented Jun 24, 2024

D'oh

/retitle [KEP-2400] Node swap updates, GA criterias and clarifications

@k8s-ci-robot k8s-ci-robot changed the title [KEP-2400] Node swap ppdates, GA criterias and clarifications [KEP-2400] Node swap updates, GA criterias and clarifications Jun 24, 2024
Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

Here's a mix of feedback; I hope it is all useful.

keps/sig-node/2400-node-swap/README.md Outdated Show resolved Hide resolved
keps/sig-node/2400-node-swap/kep.yaml Show resolved Hide resolved
keps/sig-node/2400-node-swap/README.md Show resolved Hide resolved
keps/sig-node/2400-node-swap/README.md Outdated Show resolved Hide resolved
keps/sig-node/2400-node-swap/README.md Show resolved Hide resolved
keps/sig-node/2400-node-swap/README.md Show resolved Hide resolved
keps/sig-node/2400-node-swap/README.md Show resolved Hide resolved
keps/sig-node/2400-node-swap/README.md Show resolved Hide resolved
keps/sig-node/2400-node-swap/README.md Show resolved Hide resolved
keps/sig-node/2400-node-swap/kep.yaml Outdated Show resolved Hide resolved
@iholder101 iholder101 force-pushed the kep2400/post_beta2 branch 2 times, most recently from a85dd1c to 8e73c9e Compare July 10, 2024 14:04
@kannon92
Copy link
Contributor

kannon92 commented Oct 12, 2024

Really, we should have done this before it was enabled (feature gate) by default. A lesson to learn perhaps.

We have discussed this many times over this feature. We decided to drop UnlimitedSwap to help mitigate this issue. As implementing swap eviction was considered to be difficult/risky with a KEP that has been in active development for many years. But it seems that the requirements changed right before we promoted to stable. We know that eviction is important but we were thinking that "Basic Swap enablement" could go in first and then we can refine over time.

This sorta brings up a philosophical difference on what stable means in this context. We realize that Swap is an advanced functionality that requires kernel tuning for best practices. The tuning depends on OS swap settings, OS, method of setting up swap (file-based, memory, separate disk, etc), disk io speeds and other items. I am not clear if we can ever answer all these questions with a hardware/OS agnostic project.

I still think this KEP is meant for advanced users who are comfortable with this or have a need for this. I don't think we want to encourage swap on every node at the moment. It is why we changed the default to be NoSwap as many users are already using k8s on swap enabled nodes (kind/k3s/etc). Our hope was to have a better answer than "that is not supported".

@SergeyKanzhelev
Copy link
Member

@SergeyKanzhelev also pointed out that there isn't enough user feedback on the feature. Sergey, can you share more on this?

I am worried that we are locking oursleves into the resource calculation specified here: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2400-node-swap#steps-to-calculate-swap-limit and we have very little feedback on it as far as I know. The calculation we use today has some shortcomings:

Efficiency:

  • guaranteed pods are not using their swap, but accounted for the proportional limits calculations. So a portion of a swap file will never be used. In case of swap file size = 2x of RAM it may be reasonable. For smaller swap files, it may be confuse users and complicate the node configuration .
  • Nodes are rarely binpacked well. So something like 20% of swap file will be consistently unused.
  • in case when customer is using nodes for "very burstable Pods" (jupiter notebooks?), the fact that we rely on small "request" to calculate the proportion, makes it so every "very bursteable Pod" have a very small swap space to use. I wonder if we got any usage like this.

Security:

  • The only way to disable swap on a Pod is to make it guaranteed. I wonder if any security-conscious company tried swap with encryption enabled and a need to set some containers as non-swappable, while keeping them burstable (Like a sidecar that downloads critical certificates).

Future proof:

  • if we decide we want to explicitly set limits for specific pods - we will not be able to keep the propotional formulae any longer. So a single Pod with configured limit on the node will result in potentially very different allocations.

Those are mostly off top of my head. I will think more on it. And I am very open to hear feedback on real life usage. Memory swap support is such a great feature!

@sftim
Copy link
Contributor

sftim commented Oct 12, 2024

I still think this KEP is meant for advanced users who are comfortable with this or have a need for this. I don't think we want to encourage swap on every node at the moment. It is why we changed the default to be NoSwap as many users are already using k8s on swap enabled nodes (kind/k3s/etc). Our hope was to have a better answer than "that is not supported".

We could declare that using anything other than NoSwap is still beta, but nonetheless enable the code path without a feature gate, and make a declaration that running a Linux node with some pagefile active is a supported configuration. We can make that partial declaration of graduation without changing any other feature gates.

@kannon92
Copy link
Contributor

@SergeyKanzhelev

Efficiency:

guaranteed pods are not using their swap, but accounted for the proportional limits calculations. So a portion of a swap file will never be used. In case of swap file size = 2x of RAM it may be reasonable. For smaller swap files, it may be confuse users and complicate the node configuration .
Nodes are rarely binpacked well. So something like 20% of swap file will be consistently unused.
in case when customer is using nodes for "very burstable Pods" (jupiter notebooks?), the fact that we rely on small "request" to calculate the proportion, makes it so every "very bursteable Pod" have a very small swap space to use. I wonder if we got any usage like this.

I agree that this isn't ideal. In the non goals of this swap we called out that workload level specification could be done as a followup. We wanted to be conservative with this KEP because we had to start somewhere. In this KEP, we achieved alot with this but we can't get everything we want. Our hope was that we focus on getting this KEP as is to stable so that we can iterate on future improvements to make swap more useful for workloads. At the moment, this KEP is too limited (and we are too far along to propose new APIs in this KEP) for workload swap but that was by design from the start.

the only way to disable swap on a Pod is to make it guaranteed. I wonder if any security-conscious company tried swap with encryption enabled and a need to set some containers as non-swappable, while keeping them burstable (Like a sidecar that downloads critical certificates).

I wonder if any security-conscious company tried swap with encryption enabled and a need to set some containers as non-swappable, while keeping them burstable (Like a sidecar that downloads critical certificates).

By keeping this KEP in beta, we as a k8s org are saying that this is not ready for people to try. From what I can tell, most managed services are waiting for this KEP to go to stable to consider enabling this feature (ref: aws/containers-roadmap#1714).

@iholder101 and I have been adament that we would propose a future KEP to make swap more useful for workloads. Right now, this KEP was about enablement so that we iterate on future work. I'd like to be able to have a more agile development for KEPs.

My proposal for a future KEP could include:

  • user opt in / opt out of swap via pod spec
  • swap for all QoS
  • maybe swap based limits so pod owners have control of how much swap to use
  • Node UX improvements on swap capacity (kubectl get nodes should display information about swap memory)
  • Swap aware eviction

We are stuck with what we can do with this current KEP because AFAIK we can't propose new APIs to address these areas with a KEP this far along.

@enp0s3
Copy link

enp0s3 commented Oct 13, 2024

Security:

  • The only way to disable swap on a Pod is to make it guaranteed. I wonder if any security-conscious company tried swap with encryption enabled and a need to set some containers as non-swappable, while keeping them burstable (Like a sidecar that downloads critical certificates).

@SergeyKanzhelev Hi, swap is disabled for best-effort pods as well.

@enp0s3
Copy link

enp0s3 commented Oct 13, 2024

Efficiency:

  • guaranteed pods are not using their swap, but accounted for the proportional limits calculations. So a portion of a swap file will never be used. In case of swap file size = 2x of RAM it may be reasonable. For smaller swap files, it may be confuse users and complicate the node configuration .

@SergeyKanzhelev But the guaranteed pod can die and a burstable pod can take its place instead. In that case we can't assume its always true that the burstable pods can share the unused swap slice of the guaranteed ones.

@SergeyKanzhelev
Copy link
Member

@SergeyKanzhelev But the guaranteed pod can die and a burstable pod can take its place instead. In that case we can't assume its always true that the burstable pods can share the unused swap slice of the guaranteed ones.

@enp0s3 exactly, this is why it is calculated this way. And I wonder if this is right and we have any feedback on it.

@SergeyKanzhelev Hi, swap is disabled for best-effort pods as well.

Yes, I was wondering if we get any strong signal that we need to do more. The assumption we went with is that cluster admin can enable it and all pods will be affected. I wonder if this is a serious blocker for any adoption. So far I am only aware of set ups when it is enabled on very targeted set of nodes.

we called out that workload level specification could be done as a followup.

@kannon92 it comes to the last item. I don't see a clear path to start setting swap limits per workload without breaking existing calculated proportional-allocation behavior. Maybe we can craft the high level picture how it will be enabled later if we will decide to do it.

we can't propose new APIs to address these areas with a KEP this far along.

We should discuss how those proposals would look like. Beta typically gives a reasonable sense of quality so people can try it. Also beta is still ok for us to break contracts. Right now I feel like contract we propose is working for a subset of use cases and not well tested outside of those. And I also have hard time picturing how we can extend this API without breaking what this KEP currently proposes.

@iholder101
Copy link
Contributor Author

iholder101 commented Oct 14, 2024

I don't see a clear path to start setting swap limits per workload without breaking existing calculated proportional-allocation behavior.

have hard time picturing how we can extend this API without breaking what this KEP currently proposes.

@SergeyKanzhelev Today we support two "swap bahaviors", NoSwap and LimitedSwap. The currently calculated proportional-allocation limit is used by LimitedSwap. What I had in mind is that in the future we can add other swap behvaiors which will be more flexible and customizable.

I don't believe we're locked up to the current swap limit formula that is used by LimitedSwap. In fact, we can even deprecate it in the future if we come up with something better and see it is becoming less useful. We have to start somewhere and we commit to incrementally improve swap, which I believe will be the most beneficial approach for users which are thirsty to use swap in production, even in a limited way as a starting point.

IMHO the first and most important follow-up KEPs for swap should revolve around:

  • Swap aware evictions.
  • Better swap behaviors, hence improved customizability.

@jabdoa2
Copy link

jabdoa2 commented Oct 14, 2024

I don't see a clear path to start setting swap limits per workload without breaking existing calculated proportional-allocation behavior.

have hard time picturing how we can extend this API without breaking what this KEP currently proposes.

@SergeyKanzhelev Today we support two "swap bahaviors", NoSwap and LimitedSwap. The currently calculated proportional-allocation limit is used by LimitedSwap. What I had in mind is that in the future we can add other swap behvaiors which will be more flexible and customizable.

I don't believe we're locked up to the current swap limit formula that is used by LimitedSwap. In fact, we can even deprecate it in the future if we come up with something better and see it is becoming less useful. We have to start somewhere and commit to incrementally improve swap, which I believe will be the most beneficial approach for users.

As a user the current model is a very solid starting point. Its hard (next to impossible) to exceed swap. Basically a very conservative choice. For us it fits our usecases (production workload and CI workload) pretty well. We agree that there could be more options. Especially for pods with a very low reservation it would be nice to allow slightly more swap. However, that is really nice to have and already works pretty well. If you reserve 10MB and use 1GB it is kind of expected that you eventually get evicted (which could be preventes with "reserved" swap). I also dont see anything from preventing different future strategies.

@sftim
Copy link
Contributor

sftim commented Oct 14, 2024

Should we surface a node's configured swap behavior as a node label?

@iholder101
Copy link
Contributor Author

iholder101 commented Oct 14, 2024

@SergeyKanzhelev

Efficiency:

  • guaranteed pods are not using their swap, but accounted for the proportional limits calculations. So a portion of a swap file will never be used. In case of swap file size = 2x of RAM it may be reasonable. For smaller swap files, it may be confuse users and complicate the node configuration .
  • Nodes are rarely binpacked well. So something like 20% of swap file will be consistently unused.
  • in case when customer is using nodes for "very burstable Pods" (jupiter notebooks?), the fact that we rely on small "request" to calculate the proportion, makes it so every "very bursteable Pod" have a very small swap space to use. I wonder if we got any usage like this.

The fact that some of the swap space remains untouched by Kubernetes workloads is true but carries the benefit of leaving a small amount to be used by system daemons and services running outside of Kubelet's scope, or to even be used by Kubelet itself. That's a very conservative approach to start with.

I agree that we should make this more customizable, but by default this approach is sensible IMO and is like that by design.

Security:

  • The only way to disable swap on a Pod is to make it guaranteed. I wonder if any security-conscious company tried swap with encryption enabled and a need to set some containers as non-swappable, while keeping them burstable (Like a sidecar that downloads critical certificates).

Bear in mind that as mentioned here https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md#security-risk:

end user may decide to disable swap completely for a Pod or a container in
beta 1 by making Pod guaranteed or set request == limit for a container

This will also have to be mentioned in the documentation.

Future proof:

  • if we decide we want to explicitly set limits for specific pods - we will not be able to keep the propotional formulae any longer. So a single Pod with configured limit on the node will result in potentially very different allocations.

Those are mostly off top of my head. I will think more on it. And I am very open to hear feedback on real life usage. Memory swap support is such a great feature!

See #4701 (comment).

@kannon92
Copy link
Contributor

Thank you all for the passionate debate on this feature.

We decided to meet and figure out the next steps for this feature.

The general consensus is that we want to find a way to support swap based eviction to protect the node from instability. The eviction manager should be able to step in and evict. We do not want to rely on the oomkiller and we want to teach the eviction manager to be aware of swap.

From this meeting we have the following action items for this feature:

  • We will make the eviction manager aware of swap and take action before swap is exhausted.
  • We will mention future work needed in this KEP.
  • We will update our existing beta docs.

For 1.32, this means that we will not proceed with stable promotion. The above items will be required to promote this to stable.

@Siegfriedk
Copy link

@kannon92 slightly unfortunate for our use case. Our CI/CD system wouldn't mind the limitations and Beta is an issue as we can't 'push' for proper support on Beta features.

Updates regarding the eviction manager will still be part of this KEP?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: deads2k, iholder101
Once this PR has been reviewed and has the lgtm label, please ask for approval from dchen1107. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.