-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KEP-2400] - Update KEP with findings from beta2 #4401
[KEP-2400] - Update KEP with findings from beta2 #4401
Conversation
af16a2d
to
d97e2bf
Compare
@@ -1059,6 +1169,7 @@ nodes that do not use swap memory. | |||
- **2021-01-05:** Initial design discussion document for swap support and use cases. | |||
- **2021-04-05:** Alpha KEP drafted for initial node-level swap support and implementation (KEP-2400). | |||
- **2021-08-09:** New in Kubernetes v1.22: alpha support for using swap memory: https://kubernetes.io/blog/2021/08/09/run-nodes-with-swap-alpha/ | |||
- **2024-01-12:** Updates to Beta2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to add a reference to the PR that added swap support in the stats summary? kubernetes/kubernetes#118865
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, I see that section only has links to issues, not PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section does need some love. I don't really know the dates but I guess I can find PRs some I can fill this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the section with what I could find. There are alot of PRs and I decided to focus on promotion to beta1. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/assign @dchen1107 @mrunalp @SergeyKanzhelev Since this is staying in beta, I am not sure if we need PRR reviews/approvals at the moment. |
This looks fine to me overall. |
Thank you for updating PRR as you go. The PRR changes lgtm |
- Add [swap specific tests](https://github.com/kubernetes/kubernetes/issues/120798) such as, handling the usage of | ||
swap during container restart boundaries for writes to tmpfs (which may require pod cgroup change beyond what | ||
container runtime will do at (container cgroup boundary). | ||
- Fix flaking/failing swap node e2e jobs. | ||
- Address eviction related [issue](https://github.com/kubernetes/kubernetes/issues/120800) in swap implementation. | ||
- Add `NoSwap` as the default setting. | ||
- Remove `UnlimitedSwap` as a supported option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1000
@@ -608,12 +714,18 @@ Here are specific improvements to be made: | |||
- Improve coverage for appropriate scenarios in testgrid. | |||
|
|||
#### Beta 2 | |||
- Publish a Kubernetes doc page encoring user to use encrypted swap if they wish to enable this feature. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything for the debuggability enhancement since enabling the feature might affect the performance of the node. I know we introduced swap metrics, but what about kubectl describe node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW we are still exploring what the best way of monitoring swap is. If it's direct metrics - swap traffic, or indirect, such as *wait times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll put this as something to look into @dchen1107! The issue with describe node is that I think it gives capacity for swap (ie how much swap the node actually has) but a user may get confused because they are only allowed to use up to computed limit on swap for Burstable QoS. So its a bit misleading to add a capacity when you can't really use it.
I think we should add some way to detect if swap is enabled on a node. I talked with nfd team about potentially adding a way to detect if swap is on a node so we can add useful labels for nodes for swap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the node-level swap capacity would still be a useful information even if the pods may not be able to consume it. That's the same for other resources on the node too.
Q: Is there a good way to expose per-pod swap usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick way to examine whether system-critical daemons are using swap would also be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick way to examine whether system-critical daemons are using swap would also be helpful
I don't have a great way to do this. @fabiand any ideas here?
I would look at free
and then see what is using swap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the node-level swap capacity would still be a useful information even if the pods may not be able to consume it. That's the same for other resources on the node too.
I've been asked that a few times so I think we can add this to the node information when doing kubectl describe node
.
Q: Is there a good way to expose per-pod swap usage?
I was actually thinking that a crictl stats or something like that may be useful here.
I think one could view the summary api using kubectl --raw
and view the swap stats.
It could maybe be useful to expose information at a lower level tool if that is of interest.
We would look into the container and view memory.swap.current
to see if swap was being used. The scoping was a bit confusing to me but exec into the pod and viewing the memory.swap.current
seems to be the best way to get swap information.
@iholder101 any other ways you found useful to look at per-pod swap usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been asked that a few times so I think we can add this to the node information when doing kubectl describe node
I think it makes sense. I'll get to it.
@iholder101 any other ways you found useful to look at per-pod swap usage?
What about using kubectl top
to get this information? I can dive deeper into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick way to examine whether system-critical daemons are using swap would also be helpful
Critical daemons can be provided in two areas
- node level (part of os)
- as containers
(1) is easy to monitor: system.slice contains all these services. system.slice should also be configured to not use swap (part of the kep)
(2) is more difficult i.e. rook or other CNI/CSI projects will have critical pods.
Monitoring and configuration of those is more difficult.
If we assume that pods are correctly assigned to PriorityClasses, then monitoring can be improve by displaying swap consumption by priority class.
Further - unsure about it - we could consider to implictily control swap for workloads based on priority class. I.e. disable for node- and cluster-critical workloads.
Addition thoughts:
I've often seen that pods from (2) do run as burstable, making them eligible to swap.
We either need to find a way to (a) let pods opt-out of swap independent of the qos class or (b) make pods to opt-in to swap - regardless of the qos class (burstable pods can also benefit from swap as a safety mechanism).
The broad decision we should think about is: Do we want opt-in or opt-out for workloads.
I've been asked that a few times so I think we can add this to the node information when doing kubectl describe node
@iholder101 +1 in adding to kubelet and then kubectl. Let's include two metrics: One around amount of swap, and the other on amount of swap traffic. The latter is worse than the first.
/lgtm with a minor comment which can be added later. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, kannon92 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
Update Swap KEP with findings from beta2 testing.
Our recommendation is to implement
MemorySwap = NoSwap
and to let this feature sit in beta2. We don't want to promote this to GA at this time but we realize that someone guard rails are necessary for this feature.In this KEP, we propose some best practices and allow for opt-in support for this feature regardless of the feature toggle.