From ca6bc7f792fae8858cb638866d33eb0cb7a02073 Mon Sep 17 00:00:00 2001 From: Jiaxin Shan Date: Mon, 4 Sep 2023 22:48:24 -0700 Subject: [PATCH 01/16] KEP-4176: Static Policy to spread hyperthreads across physical CPUs Co-authored-by: Lingyan Yin Co-authored-by: Zewei Ding Co-authored-by: Shengjie Xue <3150104939@zju.edu.cn> --- .../README.md | 879 ++++++++++++++++++ .../cpu-ordering.png | Bin 0 -> 16996 bytes .../kep.yaml | 42 + 3 files changed, 921 insertions(+) create mode 100644 keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md create mode 100644 keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/cpu-ordering.png create mode 100644 keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md new file mode 100644 index 00000000000..eb7a955ac4d --- /dev/null +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md @@ -0,0 +1,879 @@ + + +# KEP-4176: A new Static Policy to spread Hyperthreads across physical CPUs to better utilize CPU Cache + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1 Bytedance Database Performance Optimization](#story-1-bytedance-database-performance-optimization) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +We propose this KEP to introduce a new CPU Manager Static Policy that spreads hyper threads across physical cores. In this policy, we changed the cpu assignment sorting algorithm to sort by socket and then directly cpus without taking physical cores into ordering. It seems like a noisy neighbour issue, but it's always true. We will explain the reason in the motivation section. This policy is useful for some applications which need to take advantage of CPU Cache. + +## Motivation + + + +`full-pcpus-only` was introduced to resolve the problem that different containers sharing the same physical cores which leads noisy neighbor issue. `distribute-cpus-across-numa` distributed cpus across NUMA to make sure no single worker suffers from NUMA effects more than any other, improving the overall performance of these types of applications. These two and default behavior can not meet the requirement of some applications which need to take advantage of L2 Cache. In that case, we need to spread hyper threads across physical cores, while NUMA is not a important factor. In such cases, the assumption is that the single physical core with two applications won't be always busy. So we can take advantage of CPU Cache to improve the performance of these applications. Otherwise, it still have the noisy neighbor issue. + + +### Goals + + +- Introduce a new CPU Manager Static Policy that spreads hyper threads across physical cores without considering NUMA. +- Enhance application performance by taking advantage of L2 Cache. + +### Non-Goals + + + +- This proposal does not aim to modify the existing CPU Manager Core Binding Policies. It focuses solely on introducing a new policy for spreading hyper threads across physical cores. +- It does not address other resource allocation or management aspects within Kubernetes. + + +## Proposal + + + +We propose to add a new `CPUManager` policy option called `spread-physical-cpus-preferred` to the static CPUManager policy. When enabled, this will trigger the CPUManager to try to allocate CPUs across physical nodes as much as possible. + +### User Stories (Optional) + + + +#### Story 1 Bytedance Database Performance Optimization + +We're running DB instances in Kubernetes and adopt default static policy in the past. While, we notice that the performance of DB instances is not stable. If an instance is under pressure, in original way, it was allocated two cpus from same physical core. However, an important pattern we notice is not always all instances are busy. After exploration, we find that the CPU cache is one bottleneck, once we spread hyper threads across physical cores, the busy instance can leverage more CPU cache and performance is improved a lot. + +#### Story 2 + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + + + +The risk associated with implementing this new proposal is minimal. It pertains only to a distinct policy option within the `CPUManager` and is safeguarded by the option's inherent security measures, in addition to the default deactivation of the `CPUManagerPolicyAlphaOptions` feature gate. + +| Risk | Impact | Mitigation | +| -------------------------------------------------| -------| ---------- | +| Bugs in the implementation lead to kubelet crash | High | Disable the policy option and restart the kubelet. The workload will run but with CPU packing semantics - like it was before this new policy option was added. | + + +## Design Details + + + +Current default sorting order is `sockets`, `cores` and then `cpus`. Using machine with 2 sockets, 6 cores, 12 CPUs topology as an example, default cpu ordering is [0, 6, 2, 8, 4, 10 | 1, 7, 3, 9, 5, 11]. In that case, if cpu manager plans to allocated two cpus, [0, 6] will be picked up. However, they belong to same socket, same numa node and same physical core which can not meet our case. + + +In order to meet our use case, we can change sorting algorithm to sort by `socket` and then directly `cpus` without taking physical cores into ordering. In that case, we get cpu sequence [0, 2, 4, 6, 8, 10 | 1, 3, 5, 7, 9, 11]. From the topology information, we know [0, 2] will be allocated for 2 cpu container and 0 and 2 are from same socket but different physical cores. + + +![](./cpu-ordering.png) + + + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- : + +##### e2e tests + + + +- : + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +We anticipate no repercussions. The new policy option is voluntary and operates independently from the current selections. + +### Version Skew Strategy + +No changes needed. + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: `CPUManagerPolicyAlphaOptions` + - Components depending on the feature gate: `kubelet` +- [x] Change the kubelet configuration to set a CPUManager policy of static and a CPUManager policy option of `spread-physical-cpus-preferred` + - Will enabling / disabling the feature require downtime of the control plane? No + - Will enabling / disabling the feature require downtime or reprovisioning of a node? (Do not assume Dynamic Kubelet Config feature is enabled). Yes -- a kubelet restart is required. + +###### Does enabling the feature change any default behavior? + + +No. In order to enable the feature, the user must explicitly set the `CPUManager` policy to static, enable `CPUManagerPolicyAlphaOptions` and the CPUManager policy option to `spread-physical-cpus-preferred`. + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +For sure. The feature gate can be disabled in many ways. +- Set `CPUManager` policy to `none` +- Disable `CPUManagerPolicyAlphaOptions` +- select other policy options instead of `spread-physical-cpus-preferred` + +Ongoing workloads will maintain their operations without disruption, while upcoming tasks will receive CPU allocations in accordance with the reinstated policy. + +###### What happens if we reenable the feature if it was previously rolled back? + +If we reactivate the feature after a rollback, the outcome remains unchanged. Current containers will retain their allocations, while newly created containers will be affected. + +###### Are there any tests for feature enablement/disablement? + +A dedicated e2e test will validate the preservation of the default behavior when the feature gate is turned off or when the feature is unused. This will be conducted through two distinct test scenarios. + + + +### Rollout, Upgrade and Rollback Planning + + +We only target alpha for this release. + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +In the worst case, if the logic has panic and kubelet crashes, the kubelet will restart and the workload will run with the default policy. Running workloads won't be affected. + +###### What specific metrics should inform a rollback? + + + +I verify the correctness by checking the kubelet log and the CPU allocation of the workload. I have not added any metrics against this new feature. + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + +We manually test it in our internal environment and it works. It's worth doing automated upgrade/rollback tests in the future. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + +No. + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +Examine the kubelet configuration of a node to verify the existence of the feature gate and the utilization of the new policy option. + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +Even CPU allocation algorithm is changed in this case, it won't cause any performance regression. So we don't need to define any SLOs for this feature. + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +N/A + +### Dependencies + + + +N/A + +###### Does this feature depend on any specific services running in the cluster? + + + +This applies to any machines with hyper-threading enabled. + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + +No + +###### Will enabling / using this feature result in introducing new API types? + + +No + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + +No + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + +No + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + +No + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + +No + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + +No + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? +N/A + +###### What are other known failure modes? +N/A + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + + \ No newline at end of file diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/cpu-ordering.png b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/cpu-ordering.png new file mode 100644 index 0000000000000000000000000000000000000000..76bb1e06085620edb53ce2351413fa9e9063fc00 GIT binary patch literal 16996 zcmdtJWmKEr+wTc9P-sgbNO37%D6Yj)iWe_VaVeVMZbe!w1X@aQXbJ8V2rk7^oIr62 zPH?9}6#&m1SiaHQk)w**n=_VJSr=>fble?s<~=_H+EpR|JfV ztjBDTzp>uDvct~trg=(0{2Cha%~Fwq+_*9FJ)sG?*0zv@rLER>VCA>$k{H~^LpEHG zIsF#z7U$jTJ&^5<``T>14VETokUYj+ixBHj4)^n9JeGpnYtFD^GDfU>*5B^i8Jo5w zlhe^*ekO~Ul%=a=b8UR#o5Tt;=PZiPU7kTy!RqFguXIm`~B@iW}9W{ zp*&WBAmiykUs`8`7@;K%hWVpG-jM384EvFD>V|BpF=UNcuU{U*8PXo$%HutKYY4jq z`__;Aq|~b@v{8O#W#j(QvoS^RneoRl#$EDUMFp7>+=Kg1&31|SU-x|e$i{J{C2J~8 zGyVwkYP)kC@1v{}zH0@CsKBe9d|Ujm7sMUpq3YyMT$Y}WhxvSG1aYWuRA~u3)1!I_ z{uN(g`1z!j)K zVN=O`Kagd_!0_e?G4g@jQ|t$UC490Zw8h7~1APqXya;}kUr{&qru+?DVOM-K_<P&)&S##|kq>b!$*o+Y#YBxZ;?fA*WUZU1 zI?K*Kxb{~1SYjK;KM0hRH72Gj|3%8%yY^9@>+ok%2!0!0$b0OZirg?|MjCtxKN|nV z#B6p-mk@w84z`P?Kp6Aq6)7ek=so!bjYt4vu$}SyCVi|2Pj5XVb!q8der&4MlxDgw z86rj$9%k&roft<#D6ayDd&+#J=)vP8ne5Et^y4OYi0B8J>WMYe7pMmxlWgE{0;Sv| zmX9Q1QHVP1-5CF6;68>ZO>nSN>i+t{T-JX0u(FR*odSkwHkIS~*Q+n9l*70uVTmTmf3e&2Noh4zaKM8e9}7P z`vIC-TmK{^kvbC}T`~?DSN!C7q1ikoD+NN@-f9L)OGeyoM{#$aTG>{RPQKXY8^@=d zL@ zy(8x{*pFc}caf7s0w3A3+frazqba(!w$z|kjUgCB@cP;{_WPmAkdP2rbbW;t*7nK! zxZor2!H)zWx+1J}@R~B~hX;?7=s5=PrM2$r~8d<(!=aci192XuNOU*6dV(#s= z*#cyg@X5b3UqljEMolsg;*%c8hsNWxJW-YziC2{iSNcTtOETw?VTcY>TD*}p?+7&g z<>1qdU)(F$BVkPn(%&hIK03e^3dqGtY_gvV(Z(tw;VBsFYW$2i9vQA-Y7h45_n9ww zzfBDD9OF*MVXKmlcB%5;helOaay%#RYh#&wUP8u&bg1wv!JS{{GN2fb3G9p#qqALj z?1CGyL?A|#1s!UqSWZ0iYQlEzC|@Ajm%gG2dCkW%N!o-@`%2(x0^bMCR~+n4th6lY zp+j#(7wJAJ{bn^Hm0GAue$~&W&d$SD#=Of$&+bLS0K8OY`7GU&xgs0QvfVdxfejO$AVNXvZfZ74gq^{4MPJAK7U9Iv1SuwQ zb7#CU?~iiUcZQrSc&l-%)@>#>#N3wHim%JO`RB8+`Nipu=7^(@NsrN@SBR$$cD9s` zVskE47L69IpDMK?KTX<|qLksLRwh=#%TCL>-HNgPsu;EwwpR(dY)I8iK@ZC33Ow*o zR5wOYScp(KF%=@P!*}iU4*bBj^%YpOB?hw6LJiyn@#PTC+;68gCG{q;k({3g;M z9<}_ChqFd=Ee-y4DzjU2{*uonMT?3ivJYhP~X>R?cqlviRh>i%Zc-d#fLE;=IchHQ!wH(N}L z3%)D=R#5{cO(qMCf)z&ll1OlRDQu^R`OS9L(Yfv|-CIiPMk`k+$Fo(l6|uuqO4I_f z$dap>3b}l(5BpahKEr#~{v}W6jf3Up`0txa87uv`9r|^VGV9rsgYo5j7)bP|`K#S< zX3{ZsH}+i(<{Z}C9i_|m*rjoo%h4AbV;g~zBF=<+xah3bXf*Rau!LB0V86%U)4}sm zc6Xb18eh%yeD%uXxl71edLIwW$6rgcbK1TgdKuPetc)dGvMhcsJy*IpyUN1>pE09(YChdC1!`j68t9%`yk7cptu34%orln< zN<-L9k7h4qqw|(WeMjHgCJ!b{EdNv;?^mmeH{7T)vP~Px1U34r|4QF+f9CGKc4NA& z)0BO-)0E@OeH6PklyZwniZ|PQZO8igGtI*$3MYd|UNW9-gOwk4JK4^;S0hI@>Ncx3 zW>1cuWZioHE+=n&{Jh6tr1E#*#)VC%O}5QM(j_;#<(0cUDq4{XS>j~SWuV)zxw9En zJe1h@)!9JSklAEZ7pR4*9=F{2QDeqg#aXV4tFu!kQevue9YhyAw|a83wIxak=9t;H zhr28fs69=ib|`V^6(tp=fCT+ub?r+*C%2A4geQ#NPi9MOrPK4 zl_!HeD+&5DmebwgBR&i0fSq{N4{(4-4AyusHf+qiEegJFMmG% z1N#_5Pm#vsYwEB;JLh}+1HDSBk5iuB(p&GhXtYu@xJ5AGJWuHj-(tI6t9n~sYdXi% zbbUUU-WcAP`$uRu5iQzeb`-J0(AYTJ2$RGQ7{1}#UCdByQskvy2i=TF-4dKApKB-N z*9@g+(d`oK)*lCKAMH{sQDph6-2A?(y2u^qyBBoMuqtdH&IZW@ zp15MkX=6wJ0TjgaY0}IV_xwo|U~Ug=>&BFC_--tqlP!nNW#>UzA*y5KR{=K^@$mst zJLCl$clCsrje+tzH8m{GyD>f%c7#0^;BJI{cTwM6SXemUaQ^cN_TsmD|L0icuP<2# z@nBe3!kNl)GCDrkI~fnFO>}+tWJ&PD|7eqjun_ms002}keLgT1W;1E3#K87Ru*K!w3} z^U$=!ddgJ`|Bdb7dFi=PYt8A!xUqkRd!E`T3nMmx(n}nY`1gSS?n84;oUKRG!(Jo? zVffo$1E-`e$}!IHJ^9wiiYZj4Y&}Kr6~|Pof2G)$8CH;J><$0LO}y>d{c8wWDfGsE za_nH8YgvlLi1J$O!t3~@&e|s@#~Ef}TX_uykp0|dE_q7d1!{tM4X*spaw#OX>22)9 zpdJ!(YovfDrM288Y;a~jXw`qpCA^7&H%NWKx%o4e5qs713`C<f!S$Vb)8QY9V8{VX}3vAlhM)hBh8|MfU)^=@jpdkZhR&0dp`gBr5Z@! z(`L*)FCOf0Y;4jVXbnC*N#u*OjX5YtKGoNU$>_Ip6$N3+ONurQzx57Ls$`#|^s{N0 z^#+L)BNQj8fHrz<43U{IJ8!#i?#MbVA2vUs2Bsxf;chkj>r6TmYLu5(XDs%O2{ZIC z5JiBJ0$QPlytFDi5<>1R^9^~+AG-$R{JUCuKt=;#f#y-vaMPI=R0P(5yQugh7BL}5 z`7`3Mi_Bqs%T;K31TVwPQT8i%6?lT%Q+2`Psp6*-kngphv;$ZifRAV#B# zd_(tpd(G$-Ne$9YI7pNETP}C2>CnD9!z_#;Nu+1ly-dQLi)6m7?2+Zsv3HILv`F89*t(U!zV#TF0%Phom+{1;vO zXZpMLdrBYE2uv%Mx+;JCkP`95v7IwN;6eqD-@RGSXXP}mR`0o~)IaJqTT`u*1F!8! z;@j{6bu;5E6(S@YA#0$ENniTQCzloitqed9zbPm^&21?H4Bya6Xh)J*G zBq?{_V!&M7c~XijxoEp+<$MJ{9WY`B+l`6_3jHa60M*^Dx5jUy z;970T=iX#J_WgMI>?E@wfSDCk#z|(Hd-G8@Z+2EQ0O2H*_#Plmy6@^?%i}i@fWu0> z6K`@o;NUH)bkTw-iQF?Hi1%y+;p+LIBR8>=#pCvwy?k_VMj$;&jQT`5uH`=(i)YK( zOzL_iuHI-r`;onE&AMN;vtQJ1s@#`cmsxa$R&S^q@l*~Lz33~TES?#wL>6~Dt2L;1 z!Vf0&rAhzsU~PT)v(y^CDUr2-hV_YuUm9!XBu@&= zrpuHv>%Sb@^jNKSm8@+Zb~|x-Zdol!2o()&RXs!NkL5Gyb?=q*W@#AJCO#Wh^tCow ztFzU$&v%gUthkt2%Y~#Q@xCJzJk0;wYyQUHk32mr0K|MIIU6tC91G)%EAXd67mC~? z<}%ltX@-&?g=tZ><}kqGURX9No~6koGY*zxCyEfU3>AGj%*K~^QG_<y)H%&)CY;3p)bP?S&fP>bH-(>8i zif)uo4Kr+rF@n8f8%om~J?jHSzCjfD*e6LPZ9R&sse|-pk@EdnNK_TvP0+>Q@NVz? zMIc8>!B8Ydw`i#O87U+U<8cE;m3eXSIJ)m5Omv0}>oS z_Y`>)t}<^NENb7GmYf+si!71RSg(J*OR>$Az1<6DciD@%zzt>wKyfQjBK(q68asWTQ z?okFmmvOay=O5&je93!JDE{ejpFyL-zemr!l>Oy5mT#Hlz?aTYNbGp*eijw*o3E2v zN`N3f10BQ9#I|M`53OTWxw%HT3iHJbQ#L)Xm;HazweU77;7?#>_8fQnVAMV^Kv^S} zuVtc3zg8nh^MVnJ!4UkM3%m(W<7|!k&rv7G*8Ax$K0Ntcn*Lc;Aj^z6c|jk|gy#&) zhh`>v703U~{QvKq(7+MDq-uP-P(oY}0DFyW1(s6Gkq16Y<&faNmqk1f35nZ{38Idz zP4sQ@Y`RY$c7RaMs#S0al%^(HZt&3aP(AL5Tt4fv%Lcz=+u!JB5kt4m?i+r&+nszo zfYTukUbzW*z>+O|86H(JVXL`z7!_GBNf#(Y?ff<#I|Z4rEVa)OCn|BBK^w?w87vap z-8VS+a!8IsE?*8BEGL9Q$g;J)xPhgfW?+eCs zn9=N~uhhrKW|&lq;Y6Edp1N9;detltDDWN61m>^q|YnCsbIErGLQ2hy{Mm0(S-w7aZ?g|7_Ya~`GW7C_Or zY|z469ITkWR5xWSK!P6KF&>pOsf!z+y;2t=#uG9RBN!mo;Pg$ezmbAZo2J|5|J*wHwSivp6<^z^9UDY)0&;t7F&a95m z86ww4>i_*>Ecp%K?8vPu|2H(FA2!HztPS3YiHd>9Rn`>FN{szR?fUCzBU83LcTXu1Tu%8?cRA<$W?Uz`HJ81 z$LHAkxW`VR+M;tawG}t!zfX%e zD9QuYW0-Erxv~Gif!JO{k&1oJ(PyCCmADv@SWL{z81NIVJT-#$fP7V~e8Sr~`G4aM z^g&3^RfbC~-PIq5jtk|XXb7wm`f@KbNrhvvE(RmbuLkUr3K9-D{K0|{ z;|Hw=N>t7)G9(cm@VktTM(##y!te6be5I3vy+y&^g#aD}1Aib!5xPEtVO%53JE zX*&c1maVrREmMCz=c}6{U1y$%L_OrXl~y*3+>C=yt;85=8u$+^ApDR)0*7f(cO$7@ z#Md?sQVyA9g99Q8-yv8pSSHeFnO=Ch`s=W$| z|B_S^2g_|)jpb5%jxmLFk|E#woO^A9cyDFc_n|nWFbo!Wa*!{%rYCZ*GXv1e*T?^FRNV`oGj7ub97TfaV#EK^Ouu($jEbdHt>LlX~`@L4u~OdTR!2!>WHEQ;S!>xh~&;zW(VNvTZB{Cq?IGxjj!AotgI1 z9J87q2>h;^6*j2^rO4I3*jt~av!+LUYKsJ%VY!MLJLOyNA6QtEj}HWK)6j1tT%p%T zBf)`#R+6*fo(ngAc30y1sEa ztL-xg1{?}^iz_uIJxr3qnKUJ#rDzUtN2^uF$8-*3BF;?kk`U6tq0E{>h;}FC^SCe2c8J8CNiNT@J2r&y`A2(S z?9qu;ME;?44c_(5v1Sg;_(sd-^FmMrCK*`0Y{u?(j44M3epci0qXk0f>34Cq`5;P*I+?mA` z*x7h!b&6aCh;g+3;2ed;VAOO|bj7?+em9GV3pI!`F%H3apvikJTD5*tLOWNHF;dt| zvcT(s=)6!@E$bF;tt?`1~TQyxX+SU^z#P{h4xIo0` zc~Ub_;4qPflNb>`bvo!+cv>de9U3$Q-Sl#Ka0W8)KK&Fd;;Ao+f8@JVr7P_n_^7pr zGfPJL?Xlpm8$OYp&8%W=PV|oQ4E1wNe&Nxy*YI~!-fq0b$88>JkFa?x)6YsYl?e*E zrU8p;MZNy4xz~e03LPAWO6`Pa6c09^mx>T;K4Z{IRB2nd*B%k~@ZL{+xY?!~N=(F3 z7iKq6VgEX(P(S|K6Y!C(_~zh;N!_SUH0j*Pi-w1-` zTg$r#r6?&Ja{Um@Pr1}uA!d{pk#AaW-Vp@@OQ%u{(6-%k!7r4h2?u=yhV*83$`d{1sb`m=*{l? zWHS&k!xQ&3-Mc8irhkt7e8bToFkM0Nb3d|xlA)yGy2-TQ@K$tV0zQI#*TX;}MLt%l zFE!o7lS*;zB{$5-p@tjQLj*T-Q2m5fWZLh%~$u$k;r8{(iWaYzff1f z`EI)o{*1ALe%G#xW)ux%RfFJ>k~veTOUh;|i@Q-L=by7kiJtXrQl`+w7g}?5e6AMr zVwSjv?5HD@K72X*1(4y}7W1ydC4Xr54azZ_f=N%jGfME*peqBMFb#T@Lz_*5ZL*b=v^F2-wb_)n0U7)zDqd^!qMm~h;;MVZD}=Iy~qx7tMh?TuwF;JTdl(Q+CPg^VspbDbY1 zq8Jy1vjRnx&ZMa=z!y|y=SxE!3Au6N5fgd+0bk|8CH+ZW&RaeGWU4d9jeZ@bb^h&( zxbd$wMG3`$7$f6S@Ye5@$AQ-h52fp9CIaR*D?^zJsy=5$>xDj{cqF&)rwdlZFE~T% z90^xdTqx;WFs|Z7$C&Wr-~>UM#HlEX{Vv}HX(8MXkVCZoR2uB5JI*j?udxC!=!E}q z@x{uAOiKVo_qnwzG(HGXgqAdWW@ZBBI%?rsy?B9M6U*u0WX4Cg79Cr@%UOP@260P4 z?vRNa&;+@ULgaN$Bfh-gj*YHatd-{!y*v*HNO$j=u9%Wa^gYzh8tQn)`#ggM+>xgj zXcsKbpZ9ozwR_3<_g^K$F-wIh;sI?FkM76WStUA3SBvFWCZdox28OQM zHR1|<@45s&^HZEgT>!%KYIgFc$_#HbX>CO_?tM)W3@Mx$4MiZoPwX%I($>{45RjMD zqqs$7ddhx?Twx-j)m$tIeN1&EOQzomWGROa_nuepX zxQOXB%E}IKW;gNN^yEq)`V$Jp4D_XY6%7gYQ$cEykKtX;0+sf@Vx|8UZ2z^rL?5b< znB_&Y(!#>N_g0ewk}cYQ>b?$HmoLxk)NLjG3mpGfVbK5Cywz+jz7D2i4W}+$_Y!C& zU!}QMT~`A)>wH~s;DNJetFqSk;R_@%~1J9goDi8jlQF@ z2)hLQ8DY+why-4)@su9kn9L~pxOsA6aJd+LG^|YIx!Q%@v4@AU90T+ch5@54;dyjq z$YMByh^PTQ6CYLFj7SQVsD?`R&>hYll6Qs*{ax1E5l?ciQe>0D;Y>KI{{I8IWgL({ z?dQC`$YEEkHX}9x21+g+FnfYXER5p|JFINFzoBEYU!dqvbEx)$w7C?*$gt%;=rZkC2&Z0{@hc)BuD=N(@V}A-V!jo#QQAJ9Bmv;p?BoN` zMh(YGy=^ZXNn3djiXFl|y3U}mrKqS7z|i#k!mHXuh`HUzH$pMGB=nXom4@73MWDR! z!LfX^71!d8mWNHXMJ_mj9%vs%n%s}v`BU{a;6tT~hwpKfQbJoYDIfk_N9aD~J)EPF zbu-2CXg!^!*zpLa?c>PAett4bz#^}_bKII@qoP@45Um9|s)fpgZjN{W)Ed90F~aLQ zLG*N08GCO0rh|Sg?MM;R>2@F5z%fn8S#}@72K5eg7Pl*RLXV zf#(h2&+R1*9RZFw^tIy!B^-Yn{rAAk|Fw?nzwfV^5NxM#6DbTS)G9t?cABu055HZ+ zEa1_{)hh{L$G3-zTa%9s5{@bevsmW~nOw7nP zSf=z`$`bcdwiRblKJKKkv8gRtR+}<;{&P*^A_c;E7MA%Kz0<8;rzW1szUurjpQW|-4CSNj)bk`&-qJjfgk>3(-`pp>%1JT+eU(qz7_Y=-UGHu-^p z(SK(Cp9Q)9%3e_j|I;>*_QbzWp#8ZGFU9lx{_4PSr|T?ySzsJ?4~hh7p#9%_iG2De z_N$6v--NvQ`DXwX`my_nApw+Hd7t0{Z}I&<+9z`d5)K)kTb|KY&YKzd1`3K9!~~+2 zFnO3%Op2X+wY!d!y#Qc5AFh&pIO#LA!T`n3JZ|J`c)p}yZsVa7&K)aWgJ!;2MI`{Q zM+OsJ81EDyfAt+s?(lbjR$kB>wN|~I2O)_sYlXxGnS2JLCSkXNsHZ3?)Ep4OVmVU} zlu2B8eY>Ry*iM7X1Ht$!P??InK|QV7(oCAvBrM7u&2U}`)8s!Wn3#SLoZ>uU*9qYvg7^Q zq?=LmP85N+_>}DCZ*LPK0(gfg6vFnzFEZN`*sVDrecs9tSf;mLxpG^~d25xpoA@Ph zrxInK7`IYycZn&)oa0|YJQg)~b3qL?AfYk4+j3pc?yOA_E(sUw$6)5}d9(%0-$|BSAujV*ntrGw|53e%(Ox>~K>YMVVM$2%t9zxuNv0P-1Z zW1Fj2Z&4~*_w++>v4r0WC>@;sFav5eGIVoLgL=z6NH{ojdDl^v0kd@L=UYi@1PfuX z5D9HQ#0&R^aP4pD!!nY(W*pg z`d!1~BLzrjFD-rf>YokcW_S{o-f7|B$H3J2N_ppXsMFfoM_=I-Ct~#I7(#&UWa^BQ%Z8+hS>DO_F`wG6 z7nII@j;=N|7WT0j86iUEo4rlOJC2~=t>V9wNlQc_bZ0s~k4z-$)~g-f6b<+%xT%Aw zD&4ZPWU~<>yyPHiDQeUc;jn$1#4LB}<-AWvUyf3)?H&8K{w2O^KB%ul)ZMmgq(JGx zO9aA`bL5?U=XzaP(5b!5w40R)h3L|8pvCWQ%JSJ2+C<4KGt!^QY1KhaEZyw+RWnR$ z5s%tKzNtZQXsrEK!@+ry4_03NU|I^3HIY^ol{c>F{SfWN1!rbNq@{xv5QzHTMp8hHijsvyk zrkFI^{5hEHp2D<1UIv_kyY{gPv*sCN`zM@kFx1ejQEqNx9OLkKGSNC)HmI}SSak6F zT9D}z22Y|QecGqDEl%Ze!b-%ZU!^I%j|wJXJqwLL#Efz-^}(n1TD4i6vt%x&E-eFg zQ}I}XSD={!GuT1;$5wq;mhRbo_eYsgl%s!Mli00x4tEt;p3$o;C&i(j?Z*^7fAESq>LJoEsJh{A9a&r&0zAtM9lI$hrz@hiB% zp;qzD9nKMp{oPfldwH&W4m)WEit5oG1wlpnZ$I1V$?*j-w%6JSMJA?YUiivZz=v#N zAEQmrHJs08o)Bz@0W2UK6ry-9SZ&toau)Y=pI^-Q{DaMn%^69VAN|(N`;x~CK)n$Cp zzrUH`yqbHgj(vN7Yf$>|j^WKfAnIL6=Cz-2ABe&<_0I?|wK*y@6p2Dcw94wvJ%v3p&2EFqZWS)bR>a+?_n?=qbv& z<*i@O4&tfH<;PA>YN&H@rg4QIdu!2H<-U%;Fgl(AQ+pIvRUEc5ZIsKE2x-J759PxV zr_(x;1w~i3%2`E+#uc>%hwx8d*abjfkJEBp-{z0Ok}<|yKAT~5B;@q6Fkp~V!cMxP zGfa&4H1>wY@e|2dT~sWk~u9A>Lyf$QELQ#Zyn`dRx1A z2<~}7e++a@d3)_D6cQ^9q8!_KPqzQ(NI<7qDrA+*6X~!Lr}TF1u+8CEMjPUUS%*9b zq(){TYLBPNbkIyxnHR(NLx+TQgN7-%{Q13DY3qz^ac1WnNX~qTulO!<@uEq#TPF7Z za2_V_^D^P=l`F)+_Y|%tAM37ylrCIY*FJJ&5!Y1SYU^I~DcIc6N|ejU(fY-dpU?3L zhDNLMZE^qGXe+WeFl@ck;JV|bv^g{{%=NXe;l}4~1Fkng&yC0ioSe4wX2oezOZxbe z=SsDRg6j|JYZNm~(}KRlI>(-+F6yG#P=A<8c1BYdr~g^A*MK1Q>k7`v(s8r|Cqvxs z^mJIE5@`0C$l+)-Uf0&dxZ03lc{zK&TV0)X%}u8C@OUMwP{-YS;Kv)A#IVh{)s*{z zXc9up483;fiAS|S(f#}x?RO;ua2aSl(GQX!u^r0V7Bt{_K!0T{u%NItTPwx=0jTxx zcR4>V>?!hx+01jo1?*AF<0GHkr7h8&uE|7~R-sbfkec|HzAHU%N017uvBo?;t4slq zL*px0lsHvL*h8+6a5F`R&5E-M3FeCp_5#WHi~d~7xZw@BuPg72@_c&cZ* zV(}eJS{>>YSytQ|^Z5x)lv<(w2n0F>A?Ep_%E}m$38?$!=E`a5*|ywZNx{~E5G7Y+ z=0@9pJsmif)eY3MapK~=5Ng;kFa|mc;*Z{U-?nMAW%k^3uV=E+v!j%;&IuAx&6p|j zI|=ITleuf=EI3@UKdY`7%6lAGc~MHUS`5Ds^UgKCz45!_mwR!N*h@Egf~%ik_eA+w zUyiMJ5Xz)z&;*ra2D4kz=Q2PHdI?p3l%8`JUijRPy@P%tm$`Gj1^pjcwX_v$^rn6dYkG5g4O3%o?$4sv% zQVIsSkHm|H8P<4iMx;H`i-TLF<&om}*XHBip*+N(qOWjVWgBh%LvcB!yY@|PLRsv zixC#SH9%h6IsRkDkRCPj4ApY<&`$#Uqb(CJNAwvlv@F(Z^Nr%4&A;rtg7d1y8d#@= z{FtVh?hNw)7tLqT4eM=VoIqkR9#4I_yP8@)O>VMG>j0DG7FG^z_&H8j%F@+!fwHWT(NodO6od?6akXngT> z{|WQLlVIhIdxH2x?vyL*)>06`D#2L;*AiYRha(B4J5CV03&^}f{py&M*n@G h)&B!Wf(5t%MLP?>dOGEC`qw|Ul;zds%3hg={vU3VU2^~c literal 0 HcmV?d00001 diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml new file mode 100644 index 00000000000..cfe3d863033 --- /dev/null +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml @@ -0,0 +1,42 @@ +title: New CPUManager Static Policy which spread hyperthreads across physical CPUs to better utilize CPU Cache +kep-number: 4176 +authors: + - "@Jeffwan" + - "@LingyanYin" + - "@horacexd" + - "@LastNight1997" +owning-sig: sig-node +participating-sigs: [] +status: provisional +creation-date: "2023-09-04" +reviewers: + - TBD +approvers: + - "@sig-node-tech-leads" +see-also: [] +replaces: [] + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.29" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.29" + beta: "v1.30" + stable: "v1.32" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: "CPUManagerPolicyAlphaOptions" + components: + - kubelet +disable-supported: true + +# The following PRR answers are required at beta release +metrics: [] From df0e0f7a17edc2d27d53e4010954d0c44ca97fa1 Mon Sep 17 00:00:00 2001 From: Jiaxin Shan Date: Tue, 12 Sep 2023 09:59:23 -0700 Subject: [PATCH 02/16] Address feedbacks from reviewers --- .../README.md | 8 ++++++-- .../4176-cpumanager-spread-cpus-preferred-policy/kep.yaml | 2 +- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md index eb7a955ac4d..ed7c00e33a3 100644 --- a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md @@ -174,7 +174,7 @@ updates. [documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md --> -We propose this KEP to introduce a new CPU Manager Static Policy that spreads hyper threads across physical cores. In this policy, we changed the cpu assignment sorting algorithm to sort by socket and then directly cpus without taking physical cores into ordering. It seems like a noisy neighbour issue, but it's always true. We will explain the reason in the motivation section. This policy is useful for some applications which need to take advantage of CPU Cache. +We propose this KEP to introduce a new CPU Manager Static Policy Option that spreads hyper threads across physical cores. In this policy, we changed the cpu assignment sorting algorithm to sort by socket and then directly cpus without taking physical cores into ordering. It seems like a noisy neighbour issue, but it's always true. We will explain the reason in the motivation section. This policy is useful for some applications which need to take advantage of CPU Cache. ## Motivation @@ -431,6 +431,11 @@ in back-to-back releases. - Deprecate the flag --> +#### Alpha + +- Feature implemented behind the existing static policy feature flag +- Initial e2e tests completed and enabled + ### Upgrade / Downgrade Strategy -We only target alpha for this release. ###### How can a rollout or rollback fail? Can it impact already running workloads? diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml index cfe3d863033..443988b9446 100644 --- a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml @@ -10,7 +10,7 @@ participating-sigs: [] status: provisional creation-date: "2023-09-04" reviewers: - - TBD + - "@ffromani" approvers: - "@sig-node-tech-leads" see-also: [] From 981887390d3963ba60296d5c1612f22b78712394 Mon Sep 17 00:00:00 2001 From: Jiaxin Shan Date: Thu, 5 Oct 2023 01:46:26 -0700 Subject: [PATCH 03/16] Address reviewers 2nd review feedbacks --- .../README.md | 106 ++++-------------- 1 file changed, 22 insertions(+), 84 deletions(-) diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md index ed7c00e33a3..a67a49bebff 100644 --- a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md @@ -96,6 +96,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Integration tests](#integration-tests) - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) @@ -332,7 +333,9 @@ This can inform certain test coverage improvements that we want to do before extending the production code to implement this enhancement. --> -- ``: `` - `` +- `k8s.io/kubernetes/pkg/kubelet/cm/cpumanager`: `20231005` - `86.3%` + +new added codes would be a cpu allocation policy option. We can follow how other options are tested and add enough unit tests. ##### Integration tests @@ -351,7 +354,7 @@ For Beta and GA, add links to added tests together with links to k8s-triage for https://storage.googleapis.com/k8s-triage/index.html --> -- : +No new integration tests for kubelet are planned. ##### e2e tests @@ -365,76 +368,16 @@ https://storage.googleapis.com/k8s-triage/index.html We expect no non-infra related flakes in the last month as a GA graduation criteria. --> -- : +No new e2e tests for kubelet are planned. ### Graduation Criteria - - #### Alpha - Feature implemented behind the existing static policy feature flag -- Initial e2e tests completed and enabled +- The functionality of new CPU allocation algorithm is implemented +- Initial unit tests completed and coverage is improved +- Documents is improved and enough guidance and examples can be given to potential users. ### Upgrade / Downgrade Strategy @@ -647,8 +590,8 @@ Recall that end users cannot usually observe component logs or access metrics. - [ ] API .status - Condition name: - Other field: -- [ ] Other (treat as last resort) - - Details: +- [x] Other (treat as last resort) + - Details: Provide logical cpu allocation distribution across physical cores and also the cpu cache metrics from ecosystem. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? @@ -831,19 +774,10 @@ details). For now, we leave it here. N/A ###### What are other known failure modes? -N/A - +The failure modes is similar to existing options. It changes the way how cpu manager allocate CPUs. +It's compatible when user switch between options, however, when the pod get rescheduled, it will follow the current static option instead of previous one. + +When user switch to non static mode, then `/var/lib/kubelet/cpu_manager_state` requires deletion. This is a known compatibility issue. ###### What steps should be taken if SLOs are not being met to determine the problem? @@ -862,9 +796,13 @@ Major milestones might include: ## Drawbacks - +Let's talk about the limitation of current policies. + +1. In a cluster with sparse workloads, we try to leverage as much cpu cache as we can. `full-pcpus-only` will always allocate full phsical cores and it introduces cache competition between vcpus. + +2. `distribute-cpus-across-num` will evenly distribut CPU across NUMA nodes. In some cases, we want the application to be allocated in single NUMA node if possible, which gives better performance. + +Existing solutions can not address all the special needs from high peformance applications, that's why a new option is needed. ## Alternatives From 9072dfe74f1898d475cfa4c45adc119ca4e67d6e Mon Sep 17 00:00:00 2001 From: Jiaxin Shan Date: Thu, 5 Oct 2023 10:55:53 -0700 Subject: [PATCH 04/16] Address review feedback --- .../README.md | 25 +++++++++++-------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md index a67a49bebff..7de14bc905e 100644 --- a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md @@ -335,8 +335,6 @@ extending the production code to implement this enhancement. - `k8s.io/kubernetes/pkg/kubelet/cm/cpumanager`: `20231005` - `86.3%` -new added codes would be a cpu allocation policy option. We can follow how other options are tested and add enough unit tests. - ##### Integration tests -No new e2e tests for kubelet are planned. +These cases will be added in the existing `e2e_node` tests: + - CPU Manager works with `spread-physical-cpus-preferred` static policy option + +- Basic functionality +1. Enable `CPUManagerPolicyAlphaOptions` and configure CPUManager policy option to `spread-physical-cpus-preferred`. +2. Verify the machine has more than one physical cores. +3. Create a simple pod with a container that requires 2 cpus. +4. Verify that the container cpu allocation are across physical cores. +6. Delete the pod. ### Graduation Criteria @@ -591,7 +597,7 @@ Recall that end users cannot usually observe component logs or access metrics. - Condition name: - Other field: - [x] Other (treat as last resort) - - Details: Provide logical cpu allocation distribution across physical cores and also the cpu cache metrics from ecosystem. + - Details: Inspect the kubelet configuration of the nodes: check feature gate and usage of the new option. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? @@ -774,9 +780,12 @@ details). For now, we leave it here. N/A ###### What are other known failure modes? + The failure modes is similar to existing options. It changes the way how cpu manager allocate CPUs. It's compatible when user switch between options, however, when the pod get rescheduled, it will follow the current static option instead of previous one. +Currently, in alpha version, we will think it's incompatile with other options. User should stick to this option. Compatibility issue would be resolved in future version. + When user switch to non static mode, then `/var/lib/kubelet/cpu_manager_state` requires deletion. This is a known compatibility issue. ###### What steps should be taken if SLOs are not being met to determine the problem? @@ -796,13 +805,7 @@ Major milestones might include: ## Drawbacks -Let's talk about the limitation of current policies. - -1. In a cluster with sparse workloads, we try to leverage as much cpu cache as we can. `full-pcpus-only` will always allocate full phsical cores and it introduces cache competition between vcpus. - -2. `distribute-cpus-across-num` will evenly distribut CPU across NUMA nodes. In some cases, we want the application to be allocated in single NUMA node if possible, which gives better performance. - -Existing solutions can not address all the special needs from high peformance applications, that's why a new option is needed. +This allocation strategy tries to avoid workload taking entire physical core and it is not suitable for all workloads. For example, if the workload is CPU intensive and it's not sensitive to CPU Cache, it's not suitable to use this policy. Otherwise, the application may suffer from performance regression. ## Alternatives From 8d1542278103081666e975ba35bf6c8f5bf738ef Mon Sep 17 00:00:00 2001 From: Jiaxin Shan Date: Thu, 8 Feb 2024 23:28:21 +0800 Subject: [PATCH 05/16] Address reviewer comments for v1.30 --- keps/prod-readiness/sig-node/4176.yaml | 6 ++++++ .../4176-cpumanager-spread-cpus-preferred-policy/README.md | 4 ++-- .../4176-cpumanager-spread-cpus-preferred-policy/kep.yaml | 6 +++--- 3 files changed, 11 insertions(+), 5 deletions(-) create mode 100644 keps/prod-readiness/sig-node/4176.yaml diff --git a/keps/prod-readiness/sig-node/4176.yaml b/keps/prod-readiness/sig-node/4176.yaml new file mode 100644 index 00000000000..29fe23b3fbb --- /dev/null +++ b/keps/prod-readiness/sig-node/4176.yaml @@ -0,0 +1,6 @@ +# The KEP must have an approver from the +# "prod-readiness-approvers" group +# of http://git.k8s.io/enhancements/OWNERS_ALIASES +kep-number: 4176 +alpha: + approver: "@jpbetz" diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md index 7de14bc905e..d4972b23c27 100644 --- a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md @@ -501,7 +501,7 @@ If we reactivate the feature after a rollback, the outcome remains unchanged. Cu ###### Are there any tests for feature enablement/disablement? -A dedicated e2e test will validate the preservation of the default behavior when the feature gate is turned off or when the feature is unused. This will be conducted through two distinct test scenarios. +A dedicated e2e test will validate the preservation of the default behavior when the feature gate is turned off, when the feature is unused, when `pread-physical-cpus-preferred` turns off but `CPUManagerPolicyAlphaOptions` is disabled. This will be conducted through three distinct test scenarios. -This applies to any machines with hyper-threading enabled. +No. It doesn't rely on other Kubernetes components. ### Scalability diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml index 443988b9446..34b72dfd9ca 100644 --- a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml @@ -26,9 +26,9 @@ latest-milestone: "v1.29" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: - alpha: "v1.29" - beta: "v1.30" - stable: "v1.32" + alpha: "v1.30" + beta: "v1.31" + stable: "v1.33" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled From 077567c8c808099f8d5ad211925991d5454c822b Mon Sep 17 00:00:00 2001 From: Jiaxin Shan Date: Thu, 8 Feb 2024 23:30:57 +0800 Subject: [PATCH 06/16] Update the lastest milestone to v1.30 --- .../4176-cpumanager-spread-cpus-preferred-policy/kep.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml index 34b72dfd9ca..53a75b32cc9 100644 --- a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml @@ -22,7 +22,7 @@ stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.29" +latest-milestone: "v1.30" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: From f3a655d21aeb9c81d41555c39b08792c5d726047 Mon Sep 17 00:00:00 2001 From: Jiaxin Shan Date: Fri, 9 Feb 2024 19:48:49 +0800 Subject: [PATCH 07/16] Update keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md Co-authored-by: Kevin Klues --- .../4176-cpumanager-spread-cpus-preferred-policy/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md index d4972b23c27..1bb0cc66189 100644 --- a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md @@ -175,7 +175,7 @@ updates. [documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md --> -We propose this KEP to introduce a new CPU Manager Static Policy Option that spreads hyper threads across physical cores. In this policy, we changed the cpu assignment sorting algorithm to sort by socket and then directly cpus without taking physical cores into ordering. It seems like a noisy neighbour issue, but it's always true. We will explain the reason in the motivation section. This policy is useful for some applications which need to take advantage of CPU Cache. +In this KEP, we propose a new CPU Manager Static Policy Option called `distribute-cores-across-cpus` to prefer allocating cores from different physical CPUs on the same socket. This new policy is analogous to the `distribute-cpus-across-numa` policy option in that it proposes to *spread* cores allocations out, rather than pack them together. The main difference being that this new policy spreads individual core allocations across CPUs, whereas the existing policy spreads them across NUMA nodes. Such a policy is useful, for example, if an application wants to avoid being a noisy neighbor with itself, but still take advantage of the L2 cache by running its application threads on the same socket. ## Motivation From 564061da14d750aaedd3f59039442573826b9eec Mon Sep 17 00:00:00 2001 From: Jiaxin Shan Date: Fri, 9 Feb 2024 19:49:00 +0800 Subject: [PATCH 08/16] Update keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md Co-authored-by: Kevin Klues --- .../4176-cpumanager-spread-cpus-preferred-policy/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md index 1bb0cc66189..bdc7b182db7 100644 --- a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md @@ -59,7 +59,7 @@ should be approved by the remaining approvers and/or the owning SIG (or SIG Architecture for cross-cutting KEPs). --> -# KEP-4176: A new Static Policy to spread Hyperthreads across physical CPUs to better utilize CPU Cache +# KEP-4176: A new static policy to prefer allocating cores from different CPUs on the same socket -- Introduce a new CPU Manager Static Policy that spreads hyper threads across physical cores without considering NUMA. +- Introduce a new CPU Manager Static Policy that spreads CPUs across physical cores without considering NUMA. - Enhance application performance by taking advantage of L2 Cache. ## Non-Goals From 8aff41fad155270c215d252f1a8041620bcdede8 Mon Sep 17 00:00:00 2001 From: Jiaxin Shan Date: Fri, 9 Feb 2024 23:34:53 +0800 Subject: [PATCH 11/16] Update keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md Co-authored-by: Kevin Klues --- .../4176-cpumanager-spread-cpus-preferred-policy/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md index a5caefeeaf3..c942922487b 100644 --- a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md @@ -207,7 +207,7 @@ What is out of scope for this KEP? Listing non-goals helps to focus discussion and make progress. --> -- This proposal does not aim to modify the existing CPU Manager Core Binding Policies. It focuses solely on introducing a new policy for spreading hyper threads across physical cores. +- This proposal does not aim to modify the existing CPU Manager Core Binding Policies. It focuses solely on introducing a new policy for spreading CPUs across physical cores. - It does not address other resource allocation or management aspects within Kubernetes. From b44a5fdcb44d054dacd70533f85f3d496bbf8f3e Mon Sep 17 00:00:00 2001 From: Jiaxin Shan Date: Fri, 9 Feb 2024 23:35:08 +0800 Subject: [PATCH 12/16] Update keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md Co-authored-by: Kevin Klues --- .../README.md | 16 +--------------- 1 file changed, 1 insertion(+), 15 deletions(-) diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md index c942922487b..13fd40a2265 100644 --- a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md @@ -177,21 +177,7 @@ updates. In this KEP, we propose a new CPU Manager Static Policy Option called `distribute-cores-across-cpus` to prefer allocating cores from different physical CPUs on the same socket. This new policy is analogous to the `distribute-cpus-across-numa` policy option in that it proposes to *spread* cores allocations out, rather than pack them together. The main difference being that this new policy spreads individual core allocations across CPUs, whereas the existing policy spreads them across NUMA nodes. Such a policy is useful, for example, if an application wants to avoid being a noisy neighbor with itself, but still take advantage of the L2 cache by running its application threads on the same socket. -## Motivation - - - -`full-pcpus-only` was introduced to resolve the problem that different containers sharing the same physical cores which leads noisy neighbor issue. `distribute-cpus-across-numa` distributed cpus across NUMA to make sure no single worker suffers from NUMA effects more than any other, improving the overall performance of these types of applications. These two and default behavior can not meet the requirement of some applications which need to take advantage of L2 Cache. In that case, we need to spread hyper threads across physical cores, while NUMA is not a important factor. In such cases, the assumption is that the single physical core with two applications won't be always busy. So we can take advantage of CPU Cache to improve the performance of these applications. Otherwise, it still have the noisy neighbor issue. - - -### Goals +## Goals -In this KEP, we propose a new CPU Manager Static Policy Option called `distribute-cores-across-cpus` to prefer allocating cores from different physical CPUs on the same socket. This new policy is analogous to the `distribute-cpus-across-numa` policy option in that it proposes to *spread* cores allocations out, rather than pack them together. The main difference being that this new policy spreads individual core allocations across CPUs, whereas the existing policy spreads them across NUMA nodes. Such a policy is useful, for example, if an application wants to avoid being a noisy neighbor with itself, but still take advantage of the L2 cache by running its application threads on the same socket. +In this KEP, we propose a new CPU Manager Static Policy Option called `distribute-cpus-across-cores` to prefer allocating CPUs from different physical cores on the same socket. This new policy is analogous to the `distribute-cpus-across-numa` policy option in that it proposes to *spread* CPU allocations out, rather than pack them together. The main difference being that this new policy spreads individual CPU allocations across cores, whereas the existing policy spreads them across NUMA nodes. Such a policy is useful, for example, if an application wants to avoid being a noisy neighbor with itself, but still take advantage of the L2 cache by running its threads on the same socket. ## Goals From 4d7f4f501925875ba390c8a47c7321765ff77f8d Mon Sep 17 00:00:00 2001 From: Jiaxin Shan Date: Fri, 9 Feb 2024 23:35:29 +0800 Subject: [PATCH 14/16] Update keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md Co-authored-by: Kevin Klues --- .../4176-cpumanager-spread-cpus-preferred-policy/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md index fbdc2273840..17e0a361758 100644 --- a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md @@ -221,7 +221,7 @@ bogged down. #### Story 1 Bytedance Database Performance Optimization -We're running DB instances in Kubernetes and adopt default static policy in the past. While, we notice that the performance of DB instances is not stable. If an instance is under pressure, in original way, it was allocated two cpus from same physical core. However, an important pattern we notice is not always all instances are busy. After exploration, we find that the CPU cache is one bottleneck, once we spread hyper threads across physical cores, the busy instance can leverage more CPU cache and performance is improved a lot. +We're running DB instances in Kubernetes and adopt default static policy in the past. While, we notice that the performance of DB instances is not stable. If an instance is under pressure, in original way, it was allocated two CPUs from same physical core. However, an important pattern we notice is not always all instances are busy. After exploration, we find that the CPU cache is one bottleneck, once we allocate CPUs across physical cores, the busy instance can leverage more CPU cache and performance is improved a lot. #### Story 2 From 41739e90d60df52acd1f35ff73689485a359bcad Mon Sep 17 00:00:00 2001 From: Jiaxin Shan Date: Fri, 9 Feb 2024 23:35:50 +0800 Subject: [PATCH 15/16] Update keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md Co-authored-by: Kevin Klues --- .../4176-cpumanager-spread-cpus-preferred-policy/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md index 17e0a361758..bef1e6fe2d7 100644 --- a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md @@ -208,7 +208,7 @@ The "Design Details" section below is for the real nitty-gritty. --> -We propose to add a new `CPUManager` policy option called `spread-physical-cpus-preferred` to the static CPUManager policy. When enabled, this will trigger the CPUManager to try to allocate CPUs across physical nodes as much as possible. +We propose to add a new `CPUManager` policy option called `distribute-cpus-across-cores` to the static CPUManager policy. When enabled, this will trigger the CPUManager to try to allocate CPUs across physical cores as much as possible. It will not prohibit a CPU from being allocated on a core that already has a CPU allocated, but it will only resort to doing so once there is no other option. ### User Stories (Optional) From 5a987826acefde8e52ff3ea6e97ca7de4eaedf98 Mon Sep 17 00:00:00 2001 From: Jiaxin Shan Date: Fri, 9 Feb 2024 23:44:04 +0800 Subject: [PATCH 16/16] Update kep status and toc --- .../4176-cpumanager-spread-cpus-preferred-policy/README.md | 5 ++--- .../4176-cpumanager-spread-cpus-preferred-policy/kep.yaml | 2 +- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md index bef1e6fe2d7..fb115af290c 100644 --- a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/README.md @@ -80,9 +80,8 @@ tags, and then generate with `hack/update-toc.sh`. - [Release Signoff Checklist](#release-signoff-checklist) - [Summary](#summary) -- [Motivation](#motivation) - - [Goals](#goals) - - [Non-Goals](#non-goals) +- [Goals](#goals) +- [Non-Goals](#non-goals) - [Proposal](#proposal) - [User Stories (Optional)](#user-stories-optional) - [Story 1 Bytedance Database Performance Optimization](#story-1-bytedance-database-performance-optimization) diff --git a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml index 53a75b32cc9..15a70b70745 100644 --- a/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml +++ b/keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/kep.yaml @@ -7,7 +7,7 @@ authors: - "@LastNight1997" owning-sig: sig-node participating-sigs: [] -status: provisional +status: implementable creation-date: "2023-09-04" reviewers: - "@ffromani"