Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2400: Graduate swap to Beta 1 #3957
KEP-2400: Graduate swap to Beta 1 #3957
Changes from all commits
b19f856
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if container hasn't requested anything? It also might or might not have limit set up. Let's define the behavior in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is proposal to start with proportional approach to avoid the API change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed.
Or more generally: to avoid letting the user configure swap, which is very difficult to do and might compromise the node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In otherwords: kubeleconfig.CgroupRoot can't be a child of kubeletconfig.SystemReservedCgroups.
as an implementation note: this should probably be added to the kubelet config documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be happy for some clarifications regarding the test plan for Beta1.
This sounds a bit outdated. Can we change it to something like
Add e2e tests that verify that the right amount of swap limitation is automatically configured for burstable pods
?@SergeyKanzhelev would you mind clarifying that?
Which performance test do we want to do here? IOW, what do we want to verify?
Also, what does "pods using a tmpfs" means in this context?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SergeyKanzhelev sorry but I was also not able to gather any context around this line,
Add e2e tests that verify swap performance with pods using a tmpfs.
I see that it was added when swap was introduced as alpha.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's limit swap for cgroupv2 -only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I have a strong push-back against this, but still I would like to know, why not support both versions?
I understand that v2 is better in many ways, but should we really forcefully disallow swap usage on v1? Isn't a warning or discouraging it enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to disallow is to make sure there is a way for end users to protect themselves from a swap on specific Pods. On v1 - whatever we do, the app can still get swapped. So whatever end user will do with their Pod, there is no guarantee it wouldn't be swapped. On v2 we can provide some guarantees. Keeping in mind that v2 is something we want to encourage to use anyways, it's better to keep things safer out of the box
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that I need to update kubernetes/kubernetes#105271 to support cgroup v2 only.
For cgroup v1, I think we should just log a warning and just skip the logics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cgroup v2,
memory.swap.high
may be helpful at the node level. We can set it in the node level using a factor like the other KEP https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos for memory.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, makes sense to align with v2 QoS work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a section on using
memory.swap.high
- https://github.com/kubernetes/enhancements/pull/3957/files#diff-32d9cbd2b6817965ecee751dfe43b0d91269aee738a5799c9c9a68d04a329ec6R194There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that using swap throttling is the right approach here.
There are significant differences between
memory.high
andmemory.swap.high
and they are used in a completely different manner.memory.high
is described as"the main mechanism to control memory usage."
and"when hit, it throttles allocations by forcing them into direct reclaim to work off the excess"
[1].However,
memory.swap.high
is described as [1]:In addition, the Linux kernel commit message that brought
memory.swap.high
says [2]That is not to say that the knob itself is equivalent to memory.high
and thatSlowing misbehaving tasks down gradually allows user space oom killers or other protection mechanisms to react
.Since we don't implement any user-space OOM killers, IIUC reaching the memory.swap.high limit would freeze the processes in the cgroup forever.
This makes sense, as swap memory isn't being reclaimed as often as ram, and it's much more expensive to do so. If processes in a cgroup would reach the
memory.swap.max
boundary, however, they would have to keep allocated memory in-ram. Ifmemory.high
is defined, as the KEP suggests, the processes would reach a point where they freeze up, forced to reclaim memory to continue their operation.The bottom line is that processes that are out of memory should reclaim ram memory, not swap memory.
[1] https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v2.rst
[2] https://lore.kernel.org/linux-mm/20200602044952.g4tpmtOiL%[email protected]/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slowing misbehaving tasks down gradually allows user space oom killers or other protection mechanisms to react
IMO is a good thing. We have seen so many issues in the past where the workload would end up consuming the memory so fast that kernel won't get adequate time to react. In this scenario, the entire system would just freeze and the node would eventually transition into NOT_READY state in k8s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two lines I'm most concerned about, and don't fully understand, are:
What does "point of no return" actually means here?
Does it mean that a user-space procedure needs to resume the cgroup's execution or else it won't ever be able to allocate memory?
What are "regular operations"?
It worries me that this description is very different than the description for
memory.high
, which is said to be the "the main mechanism to control memory usage".