Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2400: Update swap KEP for 1.23 beta #2858

Merged
merged 4 commits into from
Sep 8, 2021

Conversation

ehashman
Copy link
Member

  • One-line PR description: Update node swap KEP to 1.23 beta
  • Other comments:

/sig node

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 12, 2021
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Aug 12, 2021
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 12, 2021
@ehashman ehashman changed the title [WIP] KEP-2400: Update swap KEP for 1.23 beta KEP-2400: Update swap KEP for 1.23 beta Aug 12, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 12, 2021
@ehashman
Copy link
Member Author

/assign @deads2k
for PRR

/assign @derekwaynecarr @mrunalp
for Node review

Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because graduation to beta changes the implied level of support for running with swap, I recommend that we carefully review Kubernetes documentation as part of this graduation work and look for areas that might need fixing.

One likely area is the reliance on tmpfs for information at rest. When swap was unsupported / alpha, cluster operators and app developers could assume that memory backed emptyDir volumes were never written to durable storage, and that the same would be true for Secret data used in connection with ServiceAccounts etc.

If tmpfs might be written to swap (which Linux readily does) then cluster operators may need to be warned not to enable swap. We didn't need (as much) to warn them when swap was unsupported because it was documented at a sort of “enable this and expect undefined behavior” level.

Another similar concern comes with the use of mlock() vs. not bothering. Apps that haven't worried about calling mlock() might need to think again.

Comment on lines +594 to +595
If a new node with swap memory fails to come online, it will not impact any
running components.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone does an in-place upgrade on a node (stopping kubelet, starting a new kubelet on the same server), can that fail? How?

If it could fail then an upgrade might for example take out the nodes where the control plane ought to be running as static pods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in-place upgrade would not fail unless swap access was added while the node was still online. Normally turning swap on and off at runtime isn't considered best practice for a production environment, I'd expect a node to be reimaged and rebooted, but I can mention it.

keps/sig-node/2400-node-swap/README.md Outdated Show resolved Hide resolved
@@ -622,12 +637,21 @@ checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->

KubeletConfiguration has set `failOnSwap: false`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I tell two nodes apart:

  • one has failOnSwap: false and memorySwap set to swapBehavior: LimitedSwap, with the NodeSwap feature gate enabled
  • another has failOnSwap: false and the NodeSwap feature gate disabled

via the kubernetes API? If so, I'd mention how to distinguish them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that is bubbled up to the API Server. The purpose of this question is for beta, when the feature gate is defaulted on, so you can't rely on it being turned on as a sign that the feature is in use. We might be able to check if swapBehavior is explicitly set, but empty string is equivalent to LimitedSwap.

Realistically, this KEP iterates on the existed unsupported configuration with failOnSwap: false. Because it was previously unsupported, I am assuming here that a production environment would not have it set if it were not using this feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this is beta we should assume that people have this feature gate set to a value of their choice. For alpha it was different: you need to be a little more brave to try it and most clusters run with the default, ie feature enabled.

The switch from unsupported to “mostly supported, but it's still beta” is why I'm asking about observability.

@ehashman ehashman mentioned this pull request Sep 2, 2021
60 tasks
Fill out remaining beta PRR questions, add test plans
@ehashman
Copy link
Member Author

ehashman commented Sep 3, 2021

Because graduation to beta changes the implied level of support for running with swap, I recommend that we carefully review Kubernetes documentation as part of this graduation work and look for areas that might need fixing.

I think this is fair feedback.

Note that (at least for CRI-O, I think) the CRI changes from 1.22 still haven't landed, so we can't test alpha swap support in a meaningful way in CI until that's available. This is an ambitious beta so it might slip to 1.24, but I want to ensure we're at least making forward progress this release, e.g. with adding pod-level cgroup limits and gathering data from real-world testing.

@sftim
Copy link
Contributor

sftim commented Sep 7, 2021

Once testing with an updated container runtime is possible, we could graduate it to beta-but-disabled-by-default. Even that needs the docs review, though. Improving the feature but leaving it alpha is fine of course and doesn't need the docs review I mentioned.

@sftim
Copy link
Contributor

sftim commented Sep 7, 2021

@ehashman / SIG Node, would you be willing to hold off that graduation to beta, pending work on a [not yet filed] issue in k/website to identify and update pages that implicitly assume swap is disabled?

###### What specific metrics should inform a rollback?

<!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->

Workload churn or performance degradations on nodes. The metrics will be
application/use-case specific, but we can provide some suggestions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the spot to provide those suggestions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we don't have them yet (comment below).

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

<!--
Pick one more of these and delete the rest.
-->

TBD. We will determine a set of metrics as part of beta graduation. We will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The beta criteria (including this metric) is listed as tentative. Can you commit to the beta criteria about metrics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figuring out what these metrics are is blocking for beta graduation, but we need end-user data/a release cycle of testing to determine them, which will happen during the dev cycle. The graduation of swap is a little different from a regular Kubernetes feature in that we need to be able to look at metrics node-wide.

This is why it might be a stretch to graduate this for beta this release, but I don't want to block the other required dev work for beta just because we don't have this list yet.

@deads2k
Copy link
Contributor

deads2k commented Sep 8, 2021

The PRR updates look good

/approve

remember to get up with sig-node to approve the graduation to beta as a whole.

@ehashman
Copy link
Member Author

ehashman commented Sep 8, 2021

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Sep 8, 2021
Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor request to add Swap to the MemoryStats api to track usage.

keps/sig-node/2400-node-swap/README.md Show resolved Hide resolved
@@ -426,6 +430,7 @@ _(Tentative.)_
detects on the host.
- Consider introducing new configuration modes for swap, such as a node-wide
swap limit for workloads.
- Add swap memory to the Kubelet stats api.
Copy link
Member Author

@ehashman ehashman Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr suggested this addition for beta.

@dashpole WDYT?

See also #2858 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. I don't remember what swap metrics are available for cgroups v1 vs v2, but I'd want to at least be able to tell if swap is causing problems at the node level. It may also be nice to have swap metrics at the pod/container level if the available metrics can tell me if swapping is hurting my application's performance in some way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashpole cgroup v2 has memory.swap.current which exists on non-root cgroups which is total amount of swap currently being used by the cgroup and its descendants.

It appears to already be supported in cAdvisor, we just did not include it in k8s API for MemoryStats as it was not yet supported to be on.

see: https://github.com/google/cadvisor/blob/ef7e64f9efab1257e297d7af339e94bb016cf221/container/libcontainer/handler.go#L800

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like cgroups v1 has an equivalent, so we should be all set in that regard

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

Thanks for making the updates, and I am happy with this plan. I agree with @mrunalp that we will learn a lot in real world workload testing, and so this should be fun!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 8, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, derekwaynecarr, ehashman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 8, 2021
@k8s-ci-robot k8s-ci-robot merged commit 5cfe5f2 into kubernetes:master Sep 8, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Sep 8, 2021
ravisantoshgudimetla pushed a commit to ravisantoshgudimetla/enhancements that referenced this pull request Sep 9, 2021
* Update swap KEP for 1.23 beta

Fill out remaining beta PRR questions, add test plans

* Address PRR feedback

* Add test plan note for eviction manager/MemoryPressure

* Add swap memory to Kubelet stats API
rikatz pushed a commit to rikatz/enhancements that referenced this pull request Feb 1, 2022
* Update swap KEP for 1.23 beta

Fill out remaining beta PRR questions, add test plans

* Address PRR feedback

* Add test plan note for eviction manager/MemoryPressure

* Add swap memory to Kubelet stats API
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants