KEP-4176: A Static Policy Option to spread hyperthreads across physical CPUs #4177

Jeffwan · 2023-09-05T05:49:10Z

One-line PR description: This is a KEP PR to add a new static policy in cpu manager.

Issue link: New Static Policy Proposal - Spread HT across physical CPUs #4176

Other comments:

Co-authored-by: Lingyan Yin <[email protected]> Co-authored-by: Zewei Ding <[email protected]> Co-authored-by: Shengjie Xue <[email protected]>

ffromani

initial review

keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md

ffromani · 2023-09-06T11:48:23Z

keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md

+We propose to add a new `CPUManager` policy option called `spread-physical-cpus-preferred` to the static CPUManager policy. When enabled, this will trigger the CPUManager to try to allocate CPUs across physical nodes as much as possible.
+


perhaps you already answer this later in the doc, but still let me ask:
ignoring the admittedly pretty ugly UX of this solution, if you need to run a pod with the aforementioned requirements and

assuming cpumanager static policy in effect

enable full-pcpus-only policy option

double the CPU requirement (e.g. the workload actually needs say 10, you ask 20)

assuming the pod would be QoS guaranteed of course

would this actually work?

You mean make it work with in-place pod update? If so, currently, it doesn't work. We make some changes in downstream to make cpu manager to work with in-place pod update feature. Please check https://docs.google.com/document/d/1V3DLh3pH3CD-xhhJvAnOq_oWgPyjO-vj6wY6qdew9H0/edit#heading=h.ybybfdfputt paragraph 2 for more details

This proposal concentrate more on the cpu allocation policy. since full-pcpus-only allocates the whole physical cpu which is not expected in our scenarios.

I meant something simple and brutal: overspecifying the pod requirements in the pod spec before submitting to the cluster. I'm just try to fully grasp the gaps in the existing options.

I see. it really depends on the node capacity. let's say node has 48 vcpus. If you use full-pcpus-only，it will use 10 physical cpus. If you use the new proposed solution, it will use 20 physical cpus (each cpu only have one vcpu in use). If the node only have 20 vcpus. it make no difference at the end and all vcpus are occupied.

the UX is admittedly bad and this is bad for pod density. But since noisy neighborhood was mentioned in the KEP, I think we should explore this (again, suboptimal) path and most notably the interaction with other options, like full-pcpus-only.

We need to explore and document how this option interacts with other existing options; Just to be clear, IMO making this option incompatible with others is a possibility, but we should explore the option and try to make the feature composable. Incompatibility should be the last resort. I'm inclined to say this is an alpha blocker, but we can talk about it. Surely is going to be a beta blocker.

This comment ^^^ is still relevant; we need to spell out how this option compose with the others and what we do if we have a incompatible options enabled at the same time. Probably kubelet should fail to start?
In general it will be better to make options compose - this is pretty much the expectations up until now.
If the options cannot compose, is worth explaining why in the KEP text (perhaps in the design details section)

ffromani · 2023-09-06T11:49:19Z

keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md

+Consider including folks who also work outside the SIG or subproject.
+-->
+
+The risk associated with implementing this new proposal is minimal. It pertains only to a distinct policy option within the `CPUManager` and is safeguarded by the option's inherent security measures, in addition to the default deactivation of the `CPUManagerPolicyAlphaOptions` feature gate.


in case you add a new policy, however, you will need a new feature gate. Everything else still holds true

Thanks for the reminder, I didn't explain it clearly. It's a new option instead of a new policy. So existing feature gates are good enough

keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md

ffromani · 2023-09-06T11:52:14Z

keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md

+that might indicate a serious problem?
+-->
+
+I verify the correctness by checking the kubelet log and the CPU allocation of the workload. I have not added any metrics against this new feature.


we can probably think and discuss about extending the existing cpumanager metrics

That's a good idea. This is indeed something we want to monitor to give more confidence to users

I'm (slowly) working into a PR to expose events for observability, but better not make this KEP depend on that. We can perhaps integrate later at beta graduation.

keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml

swatisehgal · 2023-09-12T17:23:35Z

/cc

pacoxu · 2023-09-14T09:09:28Z

/cc @saschagrunert
/cc

ffromani · 2023-09-21T10:04:38Z

/cc @kad

I'll review again ASAP. From my PoV, the main outstanding item is to clarify how this proposed option interacts with existing options, in particular with full-pcpus-only. Likely other options are relevant here, will add details in my next review.

ffromani · 2023-09-22T15:39:21Z

[...] the main outstanding item is to clarify how this proposed option interacts with existing options, in particular with full-pcpus-only. Likely other options are relevant here, will add details in my next review.

@Jeffwan for awareness, not sure you covered already this in the KEP or plan to. I think the proposal need to explain how this new option and

full-pcpus-only
distribute-cpus-across-numa

work together, if they are compatible with each other and not. Having incompatible options is a possibility, but the reason for the incompatibility need to be clearly documented in the design proposal. Incompatibility could lead do bad UX and can be a design aspect to iterate over.

ffromani

revamping review as the KEP freeze deadline is approaching

ffromani · 2023-10-02T13:58:02Z

keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md

+We propose to add a new `CPUManager` policy option called `spread-physical-cpus-preferred` to the static CPUManager policy. When enabled, this will trigger the CPUManager to try to allocate CPUs across physical nodes as much as possible.
+


We need to explore and document how this option interacts with other existing options; Just to be clear, IMO making this option incompatible with others is a possibility, but we should explore the option and try to make the feature composable. Incompatibility should be the last resort. I'm inclined to say this is an alpha blocker, but we can talk about it. Surely is going to be a beta blocker.

keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md

kad · 2023-10-03T14:09:16Z

... to clarify how this proposed option interacts with existing options, in particular with full-pcpus-only. Likely other options are relevant here, will add details in my next review.

IMHO, we potentially have two possibilities if both options are specified:

error as misconfiguration
do implicit things: like if node is hyperthreaded, then requests.cpu % num_hyperthreads == 0, then prefer full physical cores, else - spread across physical cores.

I'd more inclined to option 1. if behaviour is not clearly predictable, better to mark it as configuration problem and let node owner to select explicit node options with probably explicit labels, so workloads will be able to prefer one combination of options vs. another.

ffromani · 2023-10-03T14:14:05Z

... to clarify how this proposed option interacts with existing options, in particular with full-pcpus-only. Likely other options are relevant here, will add details in my next review.

IMHO, we potentially have two possibilities if both options are specified:
1. error as misconfiguration

2. do implicit things: like if node is hyperthreaded, then requests.cpu % num_hyperthreads == 0, then prefer full physical cores, else - spread across physical cores.
I'd more inclined to option 1. if behaviour is not clearly predictable, better to mark it as configuration problem and let node owner to select explicit node options with probably explicit labels, so workloads will be able to prefer one combination of options vs. another.

That would work for me in general and as specific approach. Having any form of discussion in the KEP about how options interact (or not) and why, IMO is alpha blocker; further refinement, if needed/desired, can be deferred.

Jeffwan · 2023-10-03T16:49:39Z

@ffromani Due to a recent release, I was quite swamped. However, I have more availability this month and will address the mentioned comments today

Jeffwan · 2023-10-03T16:50:08Z

/cc @LingyanYin

…DME.md Co-authored-by: Kevin Klues <[email protected]>

Jeffwan · 2024-02-09T11:54:00Z

@klueska I accepted the change and I think that's makes it more clear.

klueska

I went through the whole KEP and am inclined to approve it assuming you make the changes suggested. The major changes being to make sure that the terminology used is consistent with the terms used by Kubernetes for cores/CPUs (even if different terms are typically used outside of k8s).

In general, the change is small, self-contained, and fairly well understood. I see very little risk with moving forward with this proposal.

keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md

klueska · 2024-02-09T12:13:31Z

keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md

+Current default sorting order is `sockets`, `cores` and then `cpus`. Using machine with 2 sockets, 6 cores, 12 CPUs topology as an example, default cpu ordering is [0, 6, 2, 8, 4, 10 | 1, 7, 3, 9, 5, 11].  In that case, if cpu manager plans to allocated two cpus, [0, 6] will be picked up. However, they belong to same socket, same numa node and same physical core which can not meet our case.
+
+
+In order to meet our use case, we can change sorting algorithm to sort by `socket` and then directly `cpus` without taking physical cores into ordering. In that case, we get cpu sequence [0, 2, 4, 6, 8, 10 | 1, 3, 5, 7, 9, 11]. From the topology information, we know [0, 2] will be allocated for 2 cpu container and 0 and 2 are from same socket but different physical cores. 


I think this is sufficient, but we should make sure it is true in all cases. In any case, we can iterate on this during implementation regardless.

got it. We will at least cover enough scenarios we've seen to make sure it has good coverage

klueska · 2024-02-09T12:16:22Z

keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md

+The failure modes is similar to existing options. It changes the way how cpu manager allocate CPUs.
+It's compatible when user switch between options, however, when the pod get rescheduled, it will follow the current static option instead of previous one. 
+
+Currently, in alpha version, we will think it's incompatile with other options. User should stick to this option. Compatibility issue would be resolved in future version.


We should try and make it as compatible as possible with other options in this release. I think composability of the various policy options is one of the nice things about the policy options framework. It is obviously incompatible with the distribute-cpus-across-numa option, but I think there is probably a logical way to make it work with the full-cpus-only option (e.g. by reducing the overall number of available cores in half). this would obviously mean that only half the cores are available for allocation, but then you would be able to guarantee that you don't have any noisy neighbors, in addition to being able to take advantage of the L2 cache.

keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md

sftim · 2024-02-09T13:01:33Z

keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml

+  - "@LastNight1997"
+owning-sig: sig-node
+participating-sigs: []
+status: provisional


Is this right?

no, it should be implementable

Sounds good. I changed it to implementable

sftim · 2024-02-09T13:03:11Z

keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml

+feature-gates:
+  - name: "CPUManagerPolicyAlphaOptions"
+    components:
+      - kubelet


Does the kube-apiserver validate the names of CPU manager policy options? If so, this gate is relevant there too.

kubelet does this validation atm, I don't think this will change

It does not as far as I know.

Here are three KEPs similar to this one that introduce new policy options in the same way:

KEP-3327: Add CPUManager policy option to align CPUs by Socket instead of by NUMA node #3334

KEP-2902: Add CPUManager policy option to distribute CPUs across NUMA nodes instead of packing them #2904

node: cpumanager: add options to reject non SMT-aligned workload #2626

sftim · 2024-02-09T13:05:53Z

keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md

+This might be a good place to talk about core concepts and how they relate.
+-->
+
+### Risks and Mitigations


(How) will we drive this new policy from alpha to general availability?

It will move from being a CPUManagerPolicyAlphaOption to a CPUManagerPolicyBetaOption to not having a feature gate protecting it at all.

Here's an example of one such option moving to beta last release albeit for a similar mechanism in the TopologyManager rather than the CPUManager):

Promote Improved multi-numa alignment in Topology Manager to beta #4079

…DME.md Co-authored-by: Kevin Klues <[email protected]>

Jeffwan · 2024-02-09T15:46:26Z

I accepted most changes proposed by @klueska on the consistency of the terms used throughout the k8s codebase and also update the status to implementable. Please have another check @ffromani @klueska @sftim @mrunalp

alculquicondor · 2024-02-09T15:48:07Z

It looks like I was tagged on this at some point.... are there any scheduler implications?

klueska · 2024-02-09T16:12:39Z

@alculquicondor no. I'm not sure why you would have been tagged.

klueska · 2024-02-09T16:19:53Z

Thanks @Jeffwan for your contribution. This new policy option is straightforward and self contained and will be a nice addition to complement the existing policy options that already exist for the CPUManager.

Please file an exception request for this feature, as per the insructions here:
https://github.com/kubernetes/sig-release/blob/master/releases/EXCEPTIONS.md

Indicate that the KEP approval was delayed due to reviewer bandwidth, but has already been determined to be acceptable and is in the process of being approved now.

/lgtm
/approve

/assign @mrunalp

k8s-ci-robot · 2024-02-09T20:42:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan, jpbetz, klueska, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [jpbetz]
~~keps/sig-node/OWNERS~~ [mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Sep 5, 2023

k8s-ci-robot requested review from dchen1107 and derekwaynecarr September 5, 2023 05:49

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 5, 2023

Jeffwan force-pushed the jiaxin/4176 branch from 1603d3e to 7057e50 Compare September 5, 2023 05:49

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 5, 2023

Jeffwan force-pushed the jiaxin/4176 branch from 7057e50 to b9ded2d Compare September 5, 2023 05:55

KEP-4176: Static Policy to spread hyperthreads across physical CPUs

ca6bc7f

Co-authored-by: Lingyan Yin <[email protected]> Co-authored-by: Zewei Ding <[email protected]> Co-authored-by: Shengjie Xue <[email protected]>

Jeffwan force-pushed the jiaxin/4176 branch from b9ded2d to ca6bc7f Compare September 5, 2023 05:57

ffromani reviewed Sep 6, 2023

View reviewed changes

Address feedbacks from reviewers

df0e0f7

Jeffwan changed the title ~~KEP-4176: Static Policy to spread hyperthreads across physical CPUs~~ KEP-4176: A Static Policy Option to spread hyperthreads across physical CPUs Sep 12, 2023

k8s-ci-robot requested a review from swatisehgal September 12, 2023 17:23

k8s-ci-robot requested review from pacoxu and saschagrunert September 14, 2023 09:09

k8s-ci-robot requested a review from kad September 21, 2023 10:04

ffromani reviewed Oct 2, 2023

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2024

Update keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/REA…

564061d

…DME.md Co-authored-by: Kevin Klues <[email protected]>

klueska reviewed Feb 9, 2024

View reviewed changes

keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md Outdated Show resolved Hide resolved

sftim reviewed Feb 9, 2024

View reviewed changes

Jeffwan and others added 8 commits February 9, 2024 23:33

Update keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/REA…

39c49d2

…DME.md Co-authored-by: Kevin Klues <[email protected]>

Update keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/REA…

f2294f4

…DME.md Co-authored-by: Kevin Klues <[email protected]>

Update keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/REA…

8aff41f

…DME.md Co-authored-by: Kevin Klues <[email protected]>

Update keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/REA…

b44a5fd

…DME.md Co-authored-by: Kevin Klues <[email protected]>

Update keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/REA…

f3c867a

…DME.md Co-authored-by: Kevin Klues <[email protected]>

Update keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/REA…

4d7f4f5

…DME.md Co-authored-by: Kevin Klues <[email protected]>

Update keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/REA…

41739e9

…DME.md Co-authored-by: Kevin Klues <[email protected]>

Update kep status and toc

5a98782

k8s-ci-robot assigned mrunalp Feb 9, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2024

mrunalp approved these changes Feb 9, 2024

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2024

k8s-ci-robot merged commit 634faf2 into kubernetes:master Feb 9, 2024
4 checks passed

k8s-ci-robot added this to the v1.30 milestone Feb 9, 2024

Jeffwan deleted the jiaxin/4176 branch February 10, 2024 00:53

pacoxu mentioned this pull request Oct 9, 2024

KEP-4540: Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing #4540 #4541

Merged

		We propose to add a new `CPUManager` policy option called `spread-physical-cpus-preferred` to the static CPUManager policy. When enabled, this will trigger the CPUManager to try to allocate CPUs across physical nodes as much as possible.

		Current default sorting order is `sockets`, `cores` and then `cpus`. Using machine with 2 sockets, 6 cores, 12 CPUs topology as an example, default cpu ordering is [0, 6, 2, 8, 4, 10 \| 1, 7, 3, 9, 5, 11]. In that case, if cpu manager plans to allocated two cpus, [0, 6] will be picked up. However, they belong to same socket, same numa node and same physical core which can not meet our case.


		In order to meet our use case, we can change sorting algorithm to sort by `socket` and then directly `cpus` without taking physical cores into ordering. In that case, we get cpu sequence [0, 2, 4, 6, 8, 10 \| 1, 3, 5, 7, 9, 11]. From the topology information, we know [0, 2] will be allocated for 2 cpu container and 0 and 2 are from same socket but different physical cores.

KEP-4176: A Static Policy Option to spread hyperthreads across physical CPUs #4177

KEP-4176: A Static Policy Option to spread hyperthreads across physical CPUs #4177

Conversation

Jeffwan commented Sep 5, 2023

ffromani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jeffwan Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jeffwan Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swatisehgal commented Sep 12, 2023

pacoxu commented Sep 14, 2023

ffromani commented Sep 21, 2023

ffromani commented Sep 22, 2023

ffromani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kad commented Oct 3, 2023

ffromani commented Oct 3, 2023

Jeffwan commented Oct 3, 2023

Jeffwan commented Oct 3, 2023

Jeffwan commented Feb 9, 2024

klueska left a comment

Choose a reason for hiding this comment

klueska Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

Jeffwan commented Feb 9, 2024

alculquicondor commented Feb 9, 2024

klueska commented Feb 9, 2024

klueska commented Feb 9, 2024

k8s-ci-robot commented Feb 9, 2024

Jeffwan Sep 12, 2023 •

edited

Loading

Jeffwan Sep 12, 2023 •

edited

Loading

klueska Feb 9, 2024 •

edited

Loading

klueska Feb 9, 2024 •

edited

Loading