Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4176: A Static Policy Option to spread hyperthreads across physical CPUs #4177

Merged
merged 16 commits into from
Feb 9, 2024

Conversation

Jeffwan
Copy link
Contributor

@Jeffwan Jeffwan commented Sep 5, 2023

  • One-line PR description: This is a KEP PR to add a new static policy in cpu manager.
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Sep 5, 2023
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 5, 2023
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 5, 2023
Co-authored-by: Lingyan Yin <[email protected]>
Co-authored-by: Zewei Ding <[email protected]>
Co-authored-by: Shengjie Xue <[email protected]>
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial review

Comment on lines 224 to 225
We propose to add a new `CPUManager` policy option called `spread-physical-cpus-preferred` to the static CPUManager policy. When enabled, this will trigger the CPUManager to try to allocate CPUs across physical nodes as much as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps you already answer this later in the doc, but still let me ask:
ignoring the admittedly pretty ugly UX of this solution, if you need to run a pod with the aforementioned requirements and

  • assuming cpumanager static policy in effect
  • enable full-pcpus-only policy option
  • double the CPU requirement (e.g. the workload actually needs say 10, you ask 20)
  • assuming the pod would be QoS guaranteed of course

would this actually work?

Copy link
Contributor Author

@Jeffwan Jeffwan Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean make it work with in-place pod update? If so, currently, it doesn't work. We make some changes in downstream to make cpu manager to work with in-place pod update feature. Please check https://docs.google.com/document/d/1V3DLh3pH3CD-xhhJvAnOq_oWgPyjO-vj6wY6qdew9H0/edit#heading=h.ybybfdfputt paragraph 2 for more details

This proposal concentrate more on the cpu allocation policy. since full-pcpus-only allocates the whole physical cpu which is not expected in our scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant something simple and brutal: overspecifying the pod requirements in the pod spec before submitting to the cluster. I'm just try to fully grasp the gaps in the existing options.

Copy link
Contributor Author

@Jeffwan Jeffwan Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. it really depends on the node capacity. let's say node has 48 vcpus. If you use full-pcpus-only,it will use 10 physical cpus. If you use the new proposed solution, it will use 20 physical cpus (each cpu only have one vcpu in use). If the node only have 20 vcpus. it make no difference at the end and all vcpus are occupied.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the UX is admittedly bad and this is bad for pod density. But since noisy neighborhood was mentioned in the KEP, I think we should explore this (again, suboptimal) path and most notably the interaction with other options, like full-pcpus-only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to explore and document how this option interacts with other existing options; Just to be clear, IMO making this option incompatible with others is a possibility, but we should explore the option and try to make the feature composable. Incompatibility should be the last resort. I'm inclined to say this is an alpha blocker, but we can talk about it. Surely is going to be a beta blocker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment ^^^ is still relevant; we need to spell out how this option compose with the others and what we do if we have a incompatible options enabled at the same time. Probably kubelet should fail to start?
In general it will be better to make options compose - this is pretty much the expectations up until now.
If the options cannot compose, is worth explaining why in the KEP text (perhaps in the design details section)

Consider including folks who also work outside the SIG or subproject.
-->

The risk associated with implementing this new proposal is minimal. It pertains only to a distinct policy option within the `CPUManager` and is safeguarded by the option's inherent security measures, in addition to the default deactivation of the `CPUManagerPolicyAlphaOptions` feature gate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case you add a new policy, however, you will need a new feature gate. Everything else still holds true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder, I didn't explain it clearly. It's a new option instead of a new policy. So existing feature gates are good enough

that might indicate a serious problem?
-->

I verify the correctness by checking the kubelet log and the CPU allocation of the workload. I have not added any metrics against this new feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably think and discuss about extending the existing cpumanager metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. This is indeed something we want to monitor to give more confidence to users

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm (slowly) working into a PR to expose events for observability, but better not make this KEP depend on that. We can perhaps integrate later at beta graduation.

@Jeffwan Jeffwan changed the title KEP-4176: Static Policy to spread hyperthreads across physical CPUs KEP-4176: A Static Policy Option to spread hyperthreads across physical CPUs Sep 12, 2023
@swatisehgal
Copy link
Contributor

/cc

@pacoxu
Copy link
Member

pacoxu commented Sep 14, 2023

/cc @saschagrunert
/cc

@ffromani
Copy link
Contributor

/cc @kad

I'll review again ASAP. From my PoV, the main outstanding item is to clarify how this proposed option interacts with existing options, in particular with full-pcpus-only. Likely other options are relevant here, will add details in my next review.

@k8s-ci-robot k8s-ci-robot requested a review from kad September 21, 2023 10:04
@ffromani
Copy link
Contributor

[...] the main outstanding item is to clarify how this proposed option interacts with existing options, in particular with full-pcpus-only. Likely other options are relevant here, will add details in my next review.

@Jeffwan for awareness, not sure you covered already this in the KEP or plan to. I think the proposal need to explain how this new option and

  • full-pcpus-only
  • distribute-cpus-across-numa

work together, if they are compatible with each other and not. Having incompatible options is a possibility, but the reason for the incompatibility need to be clearly documented in the design proposal. Incompatibility could lead do bad UX and can be a design aspect to iterate over.

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revamping review as the KEP freeze deadline is approaching

Comment on lines 224 to 225
We propose to add a new `CPUManager` policy option called `spread-physical-cpus-preferred` to the static CPUManager policy. When enabled, this will trigger the CPUManager to try to allocate CPUs across physical nodes as much as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to explore and document how this option interacts with other existing options; Just to be clear, IMO making this option incompatible with others is a possibility, but we should explore the option and try to make the feature composable. Incompatibility should be the last resort. I'm inclined to say this is an alpha blocker, but we can talk about it. Surely is going to be a beta blocker.

@kad
Copy link
Member

kad commented Oct 3, 2023

... to clarify how this proposed option interacts with existing options, in particular with full-pcpus-only. Likely other options are relevant here, will add details in my next review.

IMHO, we potentially have two possibilities if both options are specified:

  1. error as misconfiguration
  2. do implicit things: like if node is hyperthreaded, then requests.cpu % num_hyperthreads == 0, then prefer full physical cores, else - spread across physical cores.

I'd more inclined to option 1. if behaviour is not clearly predictable, better to mark it as configuration problem and let node owner to select explicit node options with probably explicit labels, so workloads will be able to prefer one combination of options vs. another.

@ffromani
Copy link
Contributor

ffromani commented Oct 3, 2023

... to clarify how this proposed option interacts with existing options, in particular with full-pcpus-only. Likely other options are relevant here, will add details in my next review.

IMHO, we potentially have two possibilities if both options are specified:

1. error as misconfiguration

2. do implicit things: like if node is hyperthreaded, then requests.cpu % num_hyperthreads == 0, then prefer full physical cores, else - spread across physical cores.

I'd more inclined to option 1. if behaviour is not clearly predictable, better to mark it as configuration problem and let node owner to select explicit node options with probably explicit labels, so workloads will be able to prefer one combination of options vs. another.

That would work for me in general and as specific approach. Having any form of discussion in the KEP about how options interact (or not) and why, IMO is alpha blocker; further refinement, if needed/desired, can be deferred.

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Oct 3, 2023

@ffromani Due to a recent release, I was quite swamped. However, I have more availability this month and will address the mentioned comments today

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Oct 3, 2023

/cc @LingyanYin

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2024
@Jeffwan
Copy link
Contributor Author

Jeffwan commented Feb 9, 2024

@klueska I accepted the change and I think that's makes it more clear.

Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the whole KEP and am inclined to approve it assuming you make the changes suggested. The major changes being to make sure that the terminology used is consistent with the terms used by Kubernetes for cores/CPUs (even if different terms are typically used outside of k8s).

In general, the change is small, self-contained, and fairly well understood. I see very little risk with moving forward with this proposal.

Current default sorting order is `sockets`, `cores` and then `cpus`. Using machine with 2 sockets, 6 cores, 12 CPUs topology as an example, default cpu ordering is [0, 6, 2, 8, 4, 10 | 1, 7, 3, 9, 5, 11]. In that case, if cpu manager plans to allocated two cpus, [0, 6] will be picked up. However, they belong to same socket, same numa node and same physical core which can not meet our case.


In order to meet our use case, we can change sorting algorithm to sort by `socket` and then directly `cpus` without taking physical cores into ordering. In that case, we get cpu sequence [0, 2, 4, 6, 8, 10 | 1, 3, 5, 7, 9, 11]. From the topology information, we know [0, 2] will be allocated for 2 cpu container and 0 and 2 are from same socket but different physical cores.
Copy link
Contributor

@klueska klueska Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is sufficient, but we should make sure it is true in all cases. In any case, we can iterate on this during implementation regardless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. We will at least cover enough scenarios we've seen to make sure it has good coverage

The failure modes is similar to existing options. It changes the way how cpu manager allocate CPUs.
It's compatible when user switch between options, however, when the pod get rescheduled, it will follow the current static option instead of previous one.

Currently, in alpha version, we will think it's incompatile with other options. User should stick to this option. Compatibility issue would be resolved in future version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try and make it as compatible as possible with other options in this release. I think composability of the various policy options is one of the nice things about the policy options framework. It is obviously incompatible with the distribute-cpus-across-numa option, but I think there is probably a logical way to make it work with the full-cpus-only option (e.g. by reducing the overall number of available cores in half). this would obviously mean that only half the cores are available for allocation, but then you would be able to guarantee that you don't have any noisy neighbors, in addition to being able to take advantage of the L2 cache.

- "@LastNight1997"
owning-sig: sig-node
participating-sigs: []
status: provisional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it should be implementable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I changed it to implementable

feature-gates:
- name: "CPUManagerPolicyAlphaOptions"
components:
- kubelet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the kube-apiserver validate the names of CPU manager policy options? If so, this gate is relevant there too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubelet does this validation atm, I don't think this will change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a good place to talk about core concepts and how they relate.
-->

### Risks and Mitigations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(How) will we drive this new policy from alpha to general availability?

Copy link
Contributor

@klueska klueska Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will move from being a CPUManagerPolicyAlphaOption to a CPUManagerPolicyBetaOption to not having a feature gate protecting it at all.

Here's an example of one such option moving to beta last release albeit for a similar mechanism in the TopologyManager rather than the CPUManager):

@Jeffwan
Copy link
Contributor Author

Jeffwan commented Feb 9, 2024

I accepted most changes proposed by @klueska on the consistency of the terms used throughout the k8s codebase and also update the status to implementable. Please have another check @ffromani @klueska @sftim @mrunalp

@alculquicondor
Copy link
Member

It looks like I was tagged on this at some point.... are there any scheduler implications?

@klueska
Copy link
Contributor

klueska commented Feb 9, 2024

@alculquicondor no. I'm not sure why you would have been tagged.

@klueska
Copy link
Contributor

klueska commented Feb 9, 2024

Thanks @Jeffwan for your contribution. This new policy option is straightforward and self contained and will be a nice addition to complement the existing policy options that already exist for the CPUManager.

Please file an exception request for this feature, as per the insructions here:
https://github.com/kubernetes/sig-release/blob/master/releases/EXCEPTIONS.md

Indicate that the KEP approval was delayed due to reviewer bandwidth, but has already been determined to be acceptable and is in the process of being approved now.

/lgtm
/approve

/assign @mrunalp

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan, jpbetz, klueska, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2024
@k8s-ci-robot k8s-ci-robot merged commit 634faf2 into kubernetes:master Feb 9, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Feb 9, 2024
@Jeffwan Jeffwan deleted the jiaxin/4176 branch February 10, 2024 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.