-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: Make kubelet's CPU manager respect Linux kernel isolcpus setting #2435
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/cc @ConnorDoyle |
/ok-to-test |
02bdcfe
to
f787a3b
Compare
For example: | ||
- outsourcing the management of a subset of specialized, or optimized CPUs to an external CPU manager without any (other) change in Kubelet's CPU manager | ||
- ensure proper resource accounting and separation within a hybrid infrastructure (e.g. Openstack + Kubernetes running on the same node) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more color: Current CPU Manager tuning only affects things that the kubelet manages. There are threads not managed by the kubelet that affect performance of things running under the kubelet. To handle that, the kernel offers isolcpus or systemd configurations, features which are leveraged by a large ecosystem of software. We need the kubelet to coordinate with the kernel in order to offer applications fully de-jittered cores on which to be scheduled by the kubelet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree, so I created this KEP :)
The aim of this KEP is to align kubelet with the isolcpus kernel parameter. It could be looked at as a first step on a longer journey, but right now I think this small step would be a good first step.
Was your comment just an elaboration about the subject, targeting the reviewing community; or you would like to see some change in this KEP in order to get it accepted?
From my POV I would like to restrict its scope to this specific kernel parameter, but ofc curious about what you have in mind!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, just to be on the same page, please take into account the Non-goals of this KEP:
"It is also outside the scope of this KEP to enhance the CPU manager itself with more fine-grained management policies, or introduce topology awareness into the CPU manager."
What I would like to achieve here is to leave CPU manager policies intact and unchanged, but make it possible to explicitly restrict the cores these policies can manage.
I definitely don't want to enhance kubelet's native manager so it can offer applications fully de-jittered cores (at least in the scope of this KEP). That is because multiple KEPs have been already created trying to achieve exactly that (referring to CPU pooling proposals). And let's be honest, these were quasi-rejected by SIG-Node and WG-ResourceManagement communities.
So, instead, what I would like to achieve here is to leave room for other (non K8s core) managers to do this pooling, while the community discusses how to support the same natively.
Even when that native support arrives I feel that this enhancement would be still valid and useful e.g. for the hybrid infrastructure use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to make it a bit more clear as to my POV here (representing a set of customers of course). And provide some historical context. Nothing more.
"Isolcpus" is a boot-time Linux kernel parameter, which can be used to isolate CPU cores from the generic Linux scheduler. | ||
This kernel setting is routinely used within the Linux community to manually isolate, and then assign CPUs to specialized workloads. | ||
The CPU Manager implemented within kubelet currently ignores this kernel setting when creating cpusets for Pods. | ||
This KEP proposes that CPU Manager should respects this kernel setting when assigning Pods to cpusets, through whichever supported CPU management policy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if i understood "through whichever supported CPU management policy" correctly. Can you please reword this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, sometimes I get carried away when writing :)
The meaning I wanted to convey is "respecting isolcpus should not be tied any specific CPU management policy, either all policies should respect it, or none of them"
I will elaborate on this point more when I expand the KEP, but the idea is that if this setting would be policy specific, for every policy we would need to introduce a an alternative version which respects isolcpus. So, even today we would need to introduce a "default_with_isolcpus" and a "static_with_isolcpus" policy, and it would only get worse in the future.
Instead, what I had in mind is making this node-wide configurable via a kubelet flag / feature gate, and when it is turned on it is applied regardless what is CPU management policy configured on the node.
|
||
## Motivation | ||
|
||
The CPU Manager always assumes that it is the alpha and omega on a node, when it comes to managing the CPU resources of the host. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is omega?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"alpha and omega" is just a phrase. I wanted to again just convey that kubelet assumes it is both the beginning and the end of CPU management on a node, and no other SW is running next to it which might also dabble into resource management.
can reword if you wish
|
||
The CPU Manager always assumes that it is the alpha and omega on a node, when it comes to managing the CPU resources of the host. | ||
However, in certain infrastructures this might not always be the case. | ||
While it is already possible to effectively take-away CPU cores from the CPU manager via the kube-reserved and system-reserved kubelet flags, this implicit way of expressing isolation needs is not dynamic enough to cover all use-cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may use kubelet flags to take-away few cores but even in implicit way there is no guarantee that these cores are really "isolated".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, you are right
I will add another sentence explaining this
@ConnorDoyle @dchen1107 @derekwaynecarr |
1dea757
to
ae1cc65
Compare
As the "early KEP merge" described by the process did not happen, I filled out the whole KEP. Kindly review! |
|
||
Kubelet's in-built CPU Manager always assumes that it is the primary software component managing the CPU cores of the host. | ||
However, in certain infrastructures this might not always be the case. | ||
While it is already possible to effectively take-away CPU cores from the Kubernetes managed workloads via the kube-reserved and system-reserved kubelet flags, this implicit way of declaring a Kubernetes managed CPU pool is not flexible enough to cover all use-cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What use cases aren't covered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of the use-cases where any non-Kubernetes managed processes run on a node which contains a kubelet
some of them are also mentioned in this document as a thought-raiser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a low-impact change would be to let system-reserved
to take in a cpuset. For example, syntax
--system-reserved=cpu=cpuset:0-3
would mean that system-reserved cpu allocation would be calculated to be 4000m
, and the CPUs which CPU manager would not give out (excluded from both default and explicitly assigned sets) would be 0-3
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I actually had something like that on my mind (but via a newly introduced flag), just forgot to put it into the alternatives section!
I went with the "isolcpus" way of declaration at the end as it does not require additional manual adjustment from operator, so K8s can "do the right thing by default" as Vish mentioned it below.
But I will definitely add this to the alternatives section in my next commit! I'm fine either way, I guess it comes down to what is the community more comfortable with
## Motivation | ||
|
||
Kubelet's in-built CPU Manager always assumes that it is the primary software component managing the CPU cores of the host. | ||
However, in certain infrastructures this might not always be the case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k8s makes an assumption that it owns the node not just from the perspective of cpu isolation. Trying to change that fundamental assumption would be hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hence we should not do it all ot once, but step-by-step!
this would be the first step. it makes sense to start with the CPU because we already have an existing kernel level parameter which we only need to respect to achieve it. I agree that doing the same for example with memory, or hugepages would be more difficult, but they are outside the scope of this KEP as described in the Non-goals section
Therefore, the need arises to enhance existing CPU manager with a method of explicitly defining a discontinuous pool of CPUs it can manage. | ||
Making kubelet respect the isolcpus kernel setting fulfills exactly that need, while also doing it in a de-facto standard way. | ||
|
||
If Kubernetes' CPU manager would support this more granular node configuration, then infrastructure administrators could make multiple "CPU managers" seamlessly inter-work on the same node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds too complicated. Why do admins have to manage multiple cpu managers? is there room for k8s to do the right thing by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal opinion is that the whole aim of this KEP is exactly what you describe: doing the right thing, by default when it comes to CPU.
Right now if literally anything runs on a node next to kubelet admins are forced either to evacuate their nodes, or configure their components accordingly. But the thing is that Kubernetes cannot even be configured right now to not overlap with other resource consumers.
However isolcpus generally means exactly this: "don't touch these resources, I'm gonna use them for something". If kubelet would also respect this wish of operators, isn't that exactly K8s doing the right thing by default?
IMHO it is, hence the KEP proposes simply respecting this flag, instead of requiring cluster admins to manually provide a list of cores K8s can manage.
|
||
If Kubernetes' CPU manager would support this more granular node configuration, then infrastructure administrators could make multiple "CPU managers" seamlessly inter-work on the same node. | ||
Such feature could come in handy if one would like to: | ||
- outsource the management of a subset of specialized, or optimized cores (e.g. real-time enabled CPUs, CPUs with different HT configuration etc.) to an external CPU manager without any (other) change in Kubelet's CPU manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until we can automate and refine the existing CPU management policies, I'd like to avoid opening up extensions. We risk fragmenting the project quite a bit if we open up extensions prior to having a default solution that mostly just works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would generally accept and respect this comment under other circumstances, but the thing is that actually 2 such KEPs were discussed and quasi-rejected recently by the community.
First CPU pooling KEP was trying to add a new de-facto CPU management policy to CPU manager to achieve more fine-grained CPU management.
The re-worked CPU pooling KEP was trying to make CPU manager extendable externally.
Both of them got stopped in their track, in part because of your objections.
I'm not saying this maliciously, because I totally get the reasoning of the community leaders, and to a certain degree I also agree with it :)
This is exactly why this KEP was born: this KEP does not open-up the CPU manager in any new ways, nor does it fragment the Kubernetes projects.
It only wants to make CPU manager respect boundaries which anyway it should respect by default IMHO.
Please, do consider that:
- the need is real. Infra operators cannot wait until the community decides which direction it wants to go with CPU pooling, and recent examples shows that this is still a long time away
- Device Management is already an open interface, and it won't be closed. There are already external CPU managers out there (CMK for instance just to mention the most popular). You could say the "damage" is already done
We might as well recognize it, make the most out of the situation (at least not double-bookkeep these resources), and take it as a motivation to finally come up with a plan how to make Kubernetes CPU manager so awesome that nobody would every consider employing an external manager in their cluster
If Kubernetes' CPU manager would support this more granular node configuration, then infrastructure administrators could make multiple "CPU managers" seamlessly inter-work on the same node. | ||
Such feature could come in handy if one would like to: | ||
- outsource the management of a subset of specialized, or optimized cores (e.g. real-time enabled CPUs, CPUs with different HT configuration etc.) to an external CPU manager without any (other) change in Kubelet's CPU manager | ||
- ensure proper resource accounting and separation within a hybrid infrastructure (e.g. Openstack + Kubernetes running on the same node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should k8s support such an architecture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it is a customer need, I guess :)
In any case, I only wanted to put some production use-cases into display to show that this simple feature would be useful even if Kubernetes would have the best CPU manager on the world.
But, for the sake of the argument let's pretend these production use-cases are not real: let's just simply look at the most generic situation: people have "something" on their nodes next to kubelet, not managed by kubelet.
It can be a systemd service. It can be a legacy application, it can be really anything. Kubelet already having system-reserved flag enforces the idea that resource management community already recognized this use-case!
However, I think it is somewhat naive to assume that all of these non-kubelet managed processes are running on the first couple of cores, all the time. Legacy applications are a handful, who knows how they were written back in the days, right?
I could say that even mighty Google faces this situation with Borg + Kubernetes co-running in their environment, right? Okay, maybe Google can allow physically separating these systems from each other, but other companies, projects, customers etc. might not be this lucky, or resourceful :)
I think I will add this generic use-case to the next version as a third user-story, now that you made me think about it
+1 for this feature or the alternative of allowing a cpuset to be specified in system-reserved. |
Adding the draft version of the KEP, only containing Overview section for early merge ( as dictated by the KEP process) 07.31: Updated Reviewers metadata field to re-trigger CNCF CLA check (which should be okay now)
Rephrasing a couple of sentences in the wake of the received comments.
- added a new use-case concentrating more on the every-day usage of this improvement - added a new alternative to the proposed implementation method, chaning the syntax of the system-reserved flag - defined the inter-working between this proposal, and existing capacity manipulating features
Comments incorporated. KEP expanded with a new use-case, implementation alternative, and an inter-working scenario. |
@Levovar: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind kep |
Bump. |
REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus. Any questions regarding this move should be directed to that thread and not asked on GitHub. |
KEPs have moved to k/enhancements. Any questions regarding this move should be directed to that thread and not asked on GitHub. |
@justaugustus: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Opening a KEP about the subject was previously aligned with @ConnorDoyle