-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KEP-2400] Node swap updates, GA criterias and clarifications #4701
base: master
Are you sure you want to change the base?
Conversation
05552e0
to
4905593
Compare
16c4878
to
b3f9708
Compare
Please update https://github.com/kubernetes/enhancements/blob/master/keps/prod-readiness/sig-node/2400.yaml and update missing bits of the PRR questionaire. |
b3f9708
to
443db7f
Compare
Thanks @deads2k!
I see you're the assigned approver for alpha/beta.
Done! PTAL :) |
f4af444
to
4124649
Compare
/retitle [KEP-2400] Node swap ppdates, GA criterias and clarifications |
D'oh /retitle [KEP-2400] Node swap updates, GA criterias and clarifications |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
Here's a mix of feedback; I hope it is all useful.
a85dd1c
to
8e73c9e
Compare
We have discussed this many times over this feature. We decided to drop UnlimitedSwap to help mitigate this issue. As implementing swap eviction was considered to be difficult/risky with a KEP that has been in active development for many years. But it seems that the requirements changed right before we promoted to stable. We know that eviction is important but we were thinking that "Basic Swap enablement" could go in first and then we can refine over time. This sorta brings up a philosophical difference on what stable means in this context. We realize that Swap is an advanced functionality that requires kernel tuning for best practices. The tuning depends on OS swap settings, OS, method of setting up swap (file-based, memory, separate disk, etc), disk io speeds and other items. I am not clear if we can ever answer all these questions with a hardware/OS agnostic project. I still think this KEP is meant for advanced users who are comfortable with this or have a need for this. I don't think we want to encourage swap on every node at the moment. It is why we changed the default to be NoSwap as many users are already using k8s on swap enabled nodes (kind/k3s/etc). Our hope was to have a better answer than "that is not supported". |
I am worried that we are locking oursleves into the resource calculation specified here: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2400-node-swap#steps-to-calculate-swap-limit and we have very little feedback on it as far as I know. The calculation we use today has some shortcomings: Efficiency:
Security:
Future proof:
Those are mostly off top of my head. I will think more on it. And I am very open to hear feedback on real life usage. Memory swap support is such a great feature! |
We could declare that using anything other than NoSwap is still beta, but nonetheless enable the code path without a feature gate, and make a declaration that running a Linux node with some pagefile active is a supported configuration. We can make that partial declaration of graduation without changing any other feature gates. |
I agree that this isn't ideal. In the non goals of this swap we called out that workload level specification could be done as a followup. We wanted to be conservative with this KEP because we had to start somewhere. In this KEP, we achieved alot with this but we can't get everything we want. Our hope was that we focus on getting this KEP as is to stable so that we can iterate on future improvements to make swap more useful for workloads. At the moment, this KEP is too limited (and we are too far along to propose new APIs in this KEP) for workload swap but that was by design from the start.
By keeping this KEP in beta, we as a k8s org are saying that this is not ready for people to try. From what I can tell, most managed services are waiting for this KEP to go to stable to consider enabling this feature (ref: aws/containers-roadmap#1714). @iholder101 and I have been adament that we would propose a future KEP to make swap more useful for workloads. Right now, this KEP was about enablement so that we iterate on future work. I'd like to be able to have a more agile development for KEPs. My proposal for a future KEP could include:
We are stuck with what we can do with this current KEP because AFAIK we can't propose new APIs to address these areas with a KEP this far along. |
@SergeyKanzhelev Hi, swap is disabled for best-effort pods as well. |
@SergeyKanzhelev But the guaranteed pod can die and a burstable pod can take its place instead. In that case we can't assume its always true that the burstable pods can share the unused swap slice of the guaranteed ones. |
@enp0s3 exactly, this is why it is calculated this way. And I wonder if this is right and we have any feedback on it.
Yes, I was wondering if we get any strong signal that we need to do more. The assumption we went with is that cluster admin can enable it and all pods will be affected. I wonder if this is a serious blocker for any adoption. So far I am only aware of set ups when it is enabled on very targeted set of nodes.
@kannon92 it comes to the last item. I don't see a clear path to start setting swap limits per workload without breaking existing calculated proportional-allocation behavior. Maybe we can craft the high level picture how it will be enabled later if we will decide to do it.
We should discuss how those proposals would look like. Beta typically gives a reasonable sense of quality so people can try it. Also beta is still ok for us to break contracts. Right now I feel like contract we propose is working for a subset of use cases and not well tested outside of those. And I also have hard time picturing how we can extend this API without breaking what this KEP currently proposes. |
@SergeyKanzhelev Today we support two "swap bahaviors", I don't believe we're locked up to the current swap limit formula that is used by IMHO the first and most important follow-up KEPs for swap should revolve around:
|
As a user the current model is a very solid starting point. Its hard (next to impossible) to exceed swap. Basically a very conservative choice. For us it fits our usecases (production workload and CI workload) pretty well. We agree that there could be more options. Especially for pods with a very low reservation it would be nice to allow slightly more swap. However, that is really nice to have and already works pretty well. If you reserve 10MB and use 1GB it is kind of expected that you eventually get evicted (which could be preventes with "reserved" swap). I also dont see anything from preventing different future strategies. |
Should we surface a node's configured swap behavior as a node label? |
The fact that some of the swap space remains untouched by Kubernetes workloads is true but carries the benefit of leaving a small amount to be used by system daemons and services running outside of Kubelet's scope, or to even be used by Kubelet itself. That's a very conservative approach to start with. I agree that we should make this more customizable, but by default this approach is sensible IMO and is like that by design.
Bear in mind that as mentioned here https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md#security-risk:
This will also have to be mentioned in the documentation.
See #4701 (comment). |
Thank you all for the passionate debate on this feature. We decided to meet and figure out the next steps for this feature. The general consensus is that we want to find a way to support swap based eviction to protect the node from instability. The eviction manager should be able to step in and evict. We do not want to rely on the oomkiller and we want to teach the eviction manager to be aware of swap. From this meeting we have the following action items for this feature:
For 1.32, this means that we will not proceed with stable promotion. The above items will be required to promote this to stable. |
@kannon92 slightly unfortunate for our use case. Our CI/CD system wouldn't mind the limitations and Beta is an issue as we can't 'push' for proper support on Beta features. Updates regarding the eviction manager will still be part of this KEP? |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: deads2k, iholder101 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Itamar Holder <[email protected]>
…avior Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
e6b9621
to
83ba31f
Compare
83ba31f
to
48c39dd
Compare
Add updates, GA criterias and clarifications
This PR updates the KEP in the following ways:
Emphasize that this KEP is about basic swap enablement
The original KEP indicated that pod-level swap APIs are out of scope:
enhancements/keps/sig-node/2400-node-swap/README.md
Lines 163 to 166 in 155a949
enhancements/keps/sig-node/2400-node-swap/README.md
Lines 142 to 144 in 155a949
However, the lack of APIs and the implicit nature of the current implementation sometimes brings suggestions to extend the API under this KEP.
This KEP focuses on a basic swap enablement. Follow-up KEPs regarding several topics (e.g. customization, zram/zswap suport, and more) will be introduced in the near future, in which we would be able to design and implement each extension in a focused way.
To ensure we're on the same page, this topic was recently raised in a sig-node meeting. In this meeting there was a very broad consensus that this approach makes sense, especially since the NodeSwap feature is important to many different parties which want it to "just work".
This PR updates the KEP to emphasize this approach.
GA criterias
The PR adds GA criterias, alongside the intent to GA in version 1.32.
Make sure PRR is ready
Updates
Since the last KEP updates, many improvements were made and many concerns were addressed. For example:
This PR updates the KEP to reflect these updates.