Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: Make kubelet's CPU manager respect Linux kernel isolcpus setting #2435

Closed
wants to merge 5 commits into from

Conversation

Levovar
Copy link

@Levovar Levovar commented Jul 30, 2018

Opening a KEP about the subject was previously aligned with @ConnorDoyle

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 30, 2018
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 30, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: calebamiles

If they are not already assigned, you can assign the PR to them by writing /assign @calebamiles in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
  • If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jul 30, 2018
@Levovar
Copy link
Author

Levovar commented Jul 30, 2018

/cc @ConnorDoyle

@k8s-ci-robot k8s-ci-robot requested a review from ConnorDoyle July 30, 2018 16:53
@jeremyeder
Copy link
Contributor

@vikaschoudhary16 @zvonkok

@ConnorDoyle
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 30, 2018
@k8s-ci-robot k8s-ci-robot requested a review from jeremyeder July 30, 2018 17:08
@Levovar Levovar force-pushed the isolcpus_kep branch 2 times, most recently from 02bdcfe to f787a3b Compare July 31, 2018 15:15
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jul 31, 2018
For example:
- outsourcing the management of a subset of specialized, or optimized CPUs to an external CPU manager without any (other) change in Kubelet's CPU manager
- ensure proper resource accounting and separation within a hybrid infrastructure (e.g. Openstack + Kubernetes running on the same node)

Copy link
Contributor

@jeremyeder jeremyeder Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more color: Current CPU Manager tuning only affects things that the kubelet manages. There are threads not managed by the kubelet that affect performance of things running under the kubelet. To handle that, the kernel offers isolcpus or systemd configurations, features which are leveraged by a large ecosystem of software. We need the kubelet to coordinate with the kernel in order to offer applications fully de-jittered cores on which to be scheduled by the kubelet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree, so I created this KEP :)

The aim of this KEP is to align kubelet with the isolcpus kernel parameter. It could be looked at as a first step on a longer journey, but right now I think this small step would be a good first step.

Was your comment just an elaboration about the subject, targeting the reviewing community; or you would like to see some change in this KEP in order to get it accepted?
From my POV I would like to restrict its scope to this specific kernel parameter, but ofc curious about what you have in mind!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just to be on the same page, please take into account the Non-goals of this KEP:
"It is also outside the scope of this KEP to enhance the CPU manager itself with more fine-grained management policies, or introduce topology awareness into the CPU manager."

What I would like to achieve here is to leave CPU manager policies intact and unchanged, but make it possible to explicitly restrict the cores these policies can manage.

I definitely don't want to enhance kubelet's native manager so it can offer applications fully de-jittered cores (at least in the scope of this KEP). That is because multiple KEPs have been already created trying to achieve exactly that (referring to CPU pooling proposals). And let's be honest, these were quasi-rejected by SIG-Node and WG-ResourceManagement communities.

So, instead, what I would like to achieve here is to leave room for other (non K8s core) managers to do this pooling, while the community discusses how to support the same natively.

Even when that native support arrives I feel that this enhancement would be still valid and useful e.g. for the hybrid infrastructure use-cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to make it a bit more clear as to my POV here (representing a set of customers of course). And provide some historical context. Nothing more.

"Isolcpus" is a boot-time Linux kernel parameter, which can be used to isolate CPU cores from the generic Linux scheduler.
This kernel setting is routinely used within the Linux community to manually isolate, and then assign CPUs to specialized workloads.
The CPU Manager implemented within kubelet currently ignores this kernel setting when creating cpusets for Pods.
This KEP proposes that CPU Manager should respects this kernel setting when assigning Pods to cpusets, through whichever supported CPU management policy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if i understood "through whichever supported CPU management policy" correctly. Can you please reword this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, sometimes I get carried away when writing :)

The meaning I wanted to convey is "respecting isolcpus should not be tied any specific CPU management policy, either all policies should respect it, or none of them"
I will elaborate on this point more when I expand the KEP, but the idea is that if this setting would be policy specific, for every policy we would need to introduce a an alternative version which respects isolcpus. So, even today we would need to introduce a "default_with_isolcpus" and a "static_with_isolcpus" policy, and it would only get worse in the future.

Instead, what I had in mind is making this node-wide configurable via a kubelet flag / feature gate, and when it is turned on it is applied regardless what is CPU management policy configured on the node.


## Motivation

The CPU Manager always assumes that it is the alpha and omega on a node, when it comes to managing the CPU resources of the host.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is omega?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"alpha and omega" is just a phrase. I wanted to again just convey that kubelet assumes it is both the beginning and the end of CPU management on a node, and no other SW is running next to it which might also dabble into resource management.

can reword if you wish


The CPU Manager always assumes that it is the alpha and omega on a node, when it comes to managing the CPU resources of the host.
However, in certain infrastructures this might not always be the case.
While it is already possible to effectively take-away CPU cores from the CPU manager via the kube-reserved and system-reserved kubelet flags, this implicit way of expressing isolation needs is not dynamic enough to cover all use-cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may use kubelet flags to take-away few cores but even in implicit way there is no guarantee that these cores are really "isolated".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, you are right
I will add another sentence explaining this

@Levovar
Copy link
Author

Levovar commented Aug 3, 2018

@ConnorDoyle @dchen1107 @derekwaynecarr
Could you take a look at the base idea presented in this review, and share your comments, if any?
If you agree with the generic direction I would be happy to flesh out the proposal in a follow-up PR, but it would be good to reserve (or free in case of a rejection) the KEP number going forward.

@Levovar Levovar force-pushed the isolcpus_kep branch 2 times, most recently from 1dea757 to ae1cc65 Compare August 7, 2018 16:37
@Levovar Levovar changed the title "Early" KEP for making CPU manager respect Linux kernel isolcpus setting Full KEP for making kubelet's CPU manager respect Linux kernel isolcpus setting Aug 7, 2018
@Levovar
Copy link
Author

Levovar commented Aug 7, 2018

As the "early KEP merge" described by the process did not happen, I filled out the whole KEP.

Kindly review!

@Levovar Levovar changed the title Full KEP for making kubelet's CPU manager respect Linux kernel isolcpus setting KEP: Make kubelet's CPU manager respect Linux kernel isolcpus setting Aug 8, 2018

Kubelet's in-built CPU Manager always assumes that it is the primary software component managing the CPU cores of the host.
However, in certain infrastructures this might not always be the case.
While it is already possible to effectively take-away CPU cores from the Kubernetes managed workloads via the kube-reserved and system-reserved kubelet flags, this implicit way of declaring a Kubernetes managed CPU pool is not flexible enough to cover all use-cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What use cases aren't covered?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of the use-cases where any non-Kubernetes managed processes run on a node which contains a kubelet
some of them are also mentioned in this document as a thought-raiser

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a low-impact change would be to let system-reserved to take in a cpuset. For example, syntax

--system-reserved=cpu=cpuset:0-3

would mean that system-reserved cpu allocation would be calculated to be 4000m, and the CPUs which CPU manager would not give out (excluded from both default and explicitly assigned sets) would be 0-3.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I actually had something like that on my mind (but via a newly introduced flag), just forgot to put it into the alternatives section!

I went with the "isolcpus" way of declaration at the end as it does not require additional manual adjustment from operator, so K8s can "do the right thing by default" as Vish mentioned it below.

But I will definitely add this to the alternatives section in my next commit! I'm fine either way, I guess it comes down to what is the community more comfortable with

## Motivation

Kubelet's in-built CPU Manager always assumes that it is the primary software component managing the CPU cores of the host.
However, in certain infrastructures this might not always be the case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k8s makes an assumption that it owns the node not just from the perspective of cpu isolation. Trying to change that fundamental assumption would be hard.

Copy link
Author

@Levovar Levovar Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hence we should not do it all ot once, but step-by-step!
this would be the first step. it makes sense to start with the CPU because we already have an existing kernel level parameter which we only need to respect to achieve it. I agree that doing the same for example with memory, or hugepages would be more difficult, but they are outside the scope of this KEP as described in the Non-goals section

Therefore, the need arises to enhance existing CPU manager with a method of explicitly defining a discontinuous pool of CPUs it can manage.
Making kubelet respect the isolcpus kernel setting fulfills exactly that need, while also doing it in a de-facto standard way.

If Kubernetes' CPU manager would support this more granular node configuration, then infrastructure administrators could make multiple "CPU managers" seamlessly inter-work on the same node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds too complicated. Why do admins have to manage multiple cpu managers? is there room for k8s to do the right thing by default?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal opinion is that the whole aim of this KEP is exactly what you describe: doing the right thing, by default when it comes to CPU.
Right now if literally anything runs on a node next to kubelet admins are forced either to evacuate their nodes, or configure their components accordingly. But the thing is that Kubernetes cannot even be configured right now to not overlap with other resource consumers.

However isolcpus generally means exactly this: "don't touch these resources, I'm gonna use them for something". If kubelet would also respect this wish of operators, isn't that exactly K8s doing the right thing by default?

IMHO it is, hence the KEP proposes simply respecting this flag, instead of requiring cluster admins to manually provide a list of cores K8s can manage.


If Kubernetes' CPU manager would support this more granular node configuration, then infrastructure administrators could make multiple "CPU managers" seamlessly inter-work on the same node.
Such feature could come in handy if one would like to:
- outsource the management of a subset of specialized, or optimized cores (e.g. real-time enabled CPUs, CPUs with different HT configuration etc.) to an external CPU manager without any (other) change in Kubelet's CPU manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we can automate and refine the existing CPU management policies, I'd like to avoid opening up extensions. We risk fragmenting the project quite a bit if we open up extensions prior to having a default solution that mostly just works.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would generally accept and respect this comment under other circumstances, but the thing is that actually 2 such KEPs were discussed and quasi-rejected recently by the community.
First CPU pooling KEP was trying to add a new de-facto CPU management policy to CPU manager to achieve more fine-grained CPU management.
The re-worked CPU pooling KEP was trying to make CPU manager extendable externally.

Both of them got stopped in their track, in part because of your objections.
I'm not saying this maliciously, because I totally get the reasoning of the community leaders, and to a certain degree I also agree with it :)

This is exactly why this KEP was born: this KEP does not open-up the CPU manager in any new ways, nor does it fragment the Kubernetes projects.
It only wants to make CPU manager respect boundaries which anyway it should respect by default IMHO.
Please, do consider that:

  • the need is real. Infra operators cannot wait until the community decides which direction it wants to go with CPU pooling, and recent examples shows that this is still a long time away
  • Device Management is already an open interface, and it won't be closed. There are already external CPU managers out there (CMK for instance just to mention the most popular). You could say the "damage" is already done
    We might as well recognize it, make the most out of the situation (at least not double-bookkeep these resources), and take it as a motivation to finally come up with a plan how to make Kubernetes CPU manager so awesome that nobody would every consider employing an external manager in their cluster

If Kubernetes' CPU manager would support this more granular node configuration, then infrastructure administrators could make multiple "CPU managers" seamlessly inter-work on the same node.
Such feature could come in handy if one would like to:
- outsource the management of a subset of specialized, or optimized cores (e.g. real-time enabled CPUs, CPUs with different HT configuration etc.) to an external CPU manager without any (other) change in Kubelet's CPU manager
- ensure proper resource accounting and separation within a hybrid infrastructure (e.g. Openstack + Kubernetes running on the same node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should k8s support such an architecture?

Copy link
Author

@Levovar Levovar Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is a customer need, I guess :)

In any case, I only wanted to put some production use-cases into display to show that this simple feature would be useful even if Kubernetes would have the best CPU manager on the world.

But, for the sake of the argument let's pretend these production use-cases are not real: let's just simply look at the most generic situation: people have "something" on their nodes next to kubelet, not managed by kubelet.

It can be a systemd service. It can be a legacy application, it can be really anything. Kubelet already having system-reserved flag enforces the idea that resource management community already recognized this use-case!
However, I think it is somewhat naive to assume that all of these non-kubelet managed processes are running on the first couple of cores, all the time. Legacy applications are a handful, who knows how they were written back in the days, right?

I could say that even mighty Google faces this situation with Borg + Kubernetes co-running in their environment, right? Okay, maybe Google can allow physically separating these systems from each other, but other companies, projects, customers etc. might not be this lucky, or resourceful :)

I think I will add this generic use-case to the next version as a third user-story, now that you made me think about it

@lmdaly
Copy link

lmdaly commented Aug 13, 2018

+1 for this feature or the alternative of allowing a cpuset to be specified in system-reserved.

Adding the draft version of the KEP, only containing Overview section for early merge ( as dictated by the KEP process)
07.31: Updated Reviewers metadata field to re-trigger CNCF CLA check (which should be okay now)
Rephrasing a couple of sentences in the wake of the received comments.
- added a new use-case concentrating more on the every-day usage of this improvement
- added a new alternative to the proposed implementation method, chaning the syntax of the system-reserved flag
- defined the inter-working between this proposal, and existing capacity manipulating features
@Levovar
Copy link
Author

Levovar commented Aug 15, 2018

Comments incorporated.

KEP expanded with a new use-case, implementation alternative, and an inter-working scenario.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 27, 2018
@k8s-ci-robot
Copy link
Contributor

@Levovar: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@justaugustus
Copy link
Member

/kind kep

@Levovar
Copy link
Author

Levovar commented Oct 19, 2018

Bump.
Would appreciate some feedback from @vishh and @ConnorDoyle regarding still outstanding blocking issues (if, any), or open questions.

@justaugustus
Copy link
Member

REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.

@justaugustus
Copy link
Member

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

@k8s-ci-robot
Copy link
Contributor

@justaugustus: Closed this PR.

In response to this:

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants