-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4176: A Static Policy Option to spread hyperthreads across physical CPUs #4177
Conversation
Jeffwan
commented
Sep 5, 2023
- One-line PR description: This is a KEP PR to add a new static policy in cpu manager.
- Issue link: New Static Policy Proposal - Spread HT across physical CPUs #4176
- Other comments:
Co-authored-by: Lingyan Yin <[email protected]> Co-authored-by: Zewei Ding <[email protected]> Co-authored-by: Shengjie Xue <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initial review
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
We propose to add a new `CPUManager` policy option called `spread-physical-cpus-preferred` to the static CPUManager policy. When enabled, this will trigger the CPUManager to try to allocate CPUs across physical nodes as much as possible. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps you already answer this later in the doc, but still let me ask:
ignoring the admittedly pretty ugly UX of this solution, if you need to run a pod with the aforementioned requirements and
- assuming cpumanager static policy in effect
- enable
full-pcpus-only
policy option - double the CPU requirement (e.g. the workload actually needs say 10, you ask 20)
- assuming the pod would be QoS guaranteed of course
would this actually work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean make it work with in-place pod update? If so, currently, it doesn't work. We make some changes in downstream to make cpu manager to work with in-place pod update feature. Please check https://docs.google.com/document/d/1V3DLh3pH3CD-xhhJvAnOq_oWgPyjO-vj6wY6qdew9H0/edit#heading=h.ybybfdfputt paragraph 2 for more details
This proposal concentrate more on the cpu allocation policy. since full-pcpus-only
allocates the whole physical cpu which is not expected in our scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant something simple and brutal: overspecifying the pod requirements in the pod spec before submitting to the cluster. I'm just try to fully grasp the gaps in the existing options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. it really depends on the node capacity. let's say node has 48 vcpus. If you use full-pcpus-only
,it will use 10 physical cpus. If you use the new proposed solution, it will use 20 physical cpus (each cpu only have one vcpu in use). If the node only have 20 vcpus. it make no difference at the end and all vcpus are occupied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the UX is admittedly bad and this is bad for pod density. But since noisy neighborhood was mentioned in the KEP, I think we should explore this (again, suboptimal) path and most notably the interaction with other options, like full-pcpus-only
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to explore and document how this option interacts with other existing options; Just to be clear, IMO making this option incompatible with others is a possibility, but we should explore the option and try to make the feature composable. Incompatibility should be the last resort. I'm inclined to say this is an alpha blocker, but we can talk about it. Surely is going to be a beta blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment ^^^ is still relevant; we need to spell out how this option compose with the others and what we do if we have a incompatible options enabled at the same time. Probably kubelet should fail to start?
In general it will be better to make options compose - this is pretty much the expectations up until now.
If the options cannot compose, is worth explaining why in the KEP text (perhaps in the design details section)
Consider including folks who also work outside the SIG or subproject. | ||
--> | ||
|
||
The risk associated with implementing this new proposal is minimal. It pertains only to a distinct policy option within the `CPUManager` and is safeguarded by the option's inherent security measures, in addition to the default deactivation of the `CPUManagerPolicyAlphaOptions` feature gate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case you add a new policy, however, you will need a new feature gate. Everything else still holds true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder, I didn't explain it clearly. It's a new option instead of a new policy. So existing feature gates are good enough
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
that might indicate a serious problem? | ||
--> | ||
|
||
I verify the correctness by checking the kubelet log and the CPU allocation of the workload. I have not added any metrics against this new feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can probably think and discuss about extending the existing cpumanager metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. This is indeed something we want to monitor to give more confidence to users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm (slowly) working into a PR to expose events for observability, but better not make this KEP depend on that. We can perhaps integrate later at beta graduation.
/cc |
/cc @saschagrunert |
/cc @kad I'll review again ASAP. From my PoV, the main outstanding item is to clarify how this proposed option interacts with existing options, in particular with |
@Jeffwan for awareness, not sure you covered already this in the KEP or plan to. I think the proposal need to explain how this new option and
work together, if they are compatible with each other and not. Having incompatible options is a possibility, but the reason for the incompatibility need to be clearly documented in the design proposal. Incompatibility could lead do bad UX and can be a design aspect to iterate over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revamping review as the KEP freeze deadline is approaching
We propose to add a new `CPUManager` policy option called `spread-physical-cpus-preferred` to the static CPUManager policy. When enabled, this will trigger the CPUManager to try to allocate CPUs across physical nodes as much as possible. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to explore and document how this option interacts with other existing options; Just to be clear, IMO making this option incompatible with others is a possibility, but we should explore the option and try to make the feature composable. Incompatibility should be the last resort. I'm inclined to say this is an alpha blocker, but we can talk about it. Surely is going to be a beta blocker.
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
IMHO, we potentially have two possibilities if both options are specified:
I'd more inclined to option 1. if behaviour is not clearly predictable, better to mark it as configuration problem and let node owner to select explicit node options with probably explicit labels, so workloads will be able to prefer one combination of options vs. another. |
That would work for me in general and as specific approach. Having any form of discussion in the KEP about how options interact (or not) and why, IMO is alpha blocker; further refinement, if needed/desired, can be deferred. |
@ffromani Due to a recent release, I was quite swamped. However, I have more availability this month and will address the mentioned comments today |
/cc @LingyanYin |
…DME.md Co-authored-by: Kevin Klues <[email protected]>
@klueska I accepted the change and I think that's makes it more clear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the whole KEP and am inclined to approve it assuming you make the changes suggested. The major changes being to make sure that the terminology used is consistent with the terms used by Kubernetes for cores/CPUs (even if different terms are typically used outside of k8s).
In general, the change is small, self-contained, and fairly well understood. I see very little risk with moving forward with this proposal.
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
Current default sorting order is `sockets`, `cores` and then `cpus`. Using machine with 2 sockets, 6 cores, 12 CPUs topology as an example, default cpu ordering is [0, 6, 2, 8, 4, 10 | 1, 7, 3, 9, 5, 11]. In that case, if cpu manager plans to allocated two cpus, [0, 6] will be picked up. However, they belong to same socket, same numa node and same physical core which can not meet our case. | ||
|
||
|
||
In order to meet our use case, we can change sorting algorithm to sort by `socket` and then directly `cpus` without taking physical cores into ordering. In that case, we get cpu sequence [0, 2, 4, 6, 8, 10 | 1, 3, 5, 7, 9, 11]. From the topology information, we know [0, 2] will be allocated for 2 cpu container and 0 and 2 are from same socket but different physical cores. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is sufficient, but we should make sure it is true in all cases. In any case, we can iterate on this during implementation regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. We will at least cover enough scenarios we've seen to make sure it has good coverage
The failure modes is similar to existing options. It changes the way how cpu manager allocate CPUs. | ||
It's compatible when user switch between options, however, when the pod get rescheduled, it will follow the current static option instead of previous one. | ||
|
||
Currently, in alpha version, we will think it's incompatile with other options. User should stick to this option. Compatibility issue would be resolved in future version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try and make it as compatible as possible with other options in this release. I think composability of the various policy options is one of the nice things about the policy options framework. It is obviously incompatible with the distribute-cpus-across-numa
option, but I think there is probably a logical way to make it work with the full-cpus-only
option (e.g. by reducing the overall number of available cores in half). this would obviously mean that only half the cores are available for allocation, but then you would be able to guarantee that you don't have any noisy neighbors, in addition to being able to take advantage of the L2 cache.
keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md
Outdated
Show resolved
Hide resolved
- "@LastNight1997" | ||
owning-sig: sig-node | ||
participating-sigs: [] | ||
status: provisional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it should be implementable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I changed it to implementable
feature-gates: | ||
- name: "CPUManagerPolicyAlphaOptions" | ||
components: | ||
- kubelet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the kube-apiserver validate the names of CPU manager policy options? If so, this gate is relevant there too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubelet does this validation atm, I don't think this will change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not as far as I know.
Here are three KEPs similar to this one that introduce new policy options in the same way:
This might be a good place to talk about core concepts and how they relate. | ||
--> | ||
|
||
### Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(How) will we drive this new policy from alpha to general availability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will move from being a CPUManagerPolicyAlphaOption
to a CPUManagerPolicyBetaOption
to not having a feature gate protecting it at all.
Here's an example of one such option moving to beta last release albeit for a similar mechanism in the TopologyManager rather than the CPUManager):
…DME.md Co-authored-by: Kevin Klues <[email protected]>
…DME.md Co-authored-by: Kevin Klues <[email protected]>
…DME.md Co-authored-by: Kevin Klues <[email protected]>
…DME.md Co-authored-by: Kevin Klues <[email protected]>
…DME.md Co-authored-by: Kevin Klues <[email protected]>
…DME.md Co-authored-by: Kevin Klues <[email protected]>
…DME.md Co-authored-by: Kevin Klues <[email protected]>
It looks like I was tagged on this at some point.... are there any scheduler implications? |
@alculquicondor no. I'm not sure why you would have been tagged. |
Thanks @Jeffwan for your contribution. This new policy option is straightforward and self contained and will be a nice addition to complement the existing policy options that already exist for the CPUManager. Please file an exception request for this feature, as per the insructions here: Indicate that the KEP approval was delayed due to reviewer bandwidth, but has already been determined to be acceptable and is in the process of being approved now. /lgtm /assign @mrunalp |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Jeffwan, jpbetz, klueska, mrunalp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |