KEP-1287: InPlacePodVerticalScaling BETA update #4704

tallclair · 2024-06-07T07:02:52Z

One-line PR description: Update PRR beta requirements sections
Issue link: In-Place Update of Pod Resources #1287
Other comments
- The version skew section needs more consideration
- Other misc beta blockers are tracked here: https://github.com/orgs/kubernetes/projects/178

esotsal · 2024-06-07T10:37:56Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

-      - Impact of its outage on the feature:
-      - Impact of its degraded performance or high-error rates on the feature:
+
+  Compatible container runtime (see [CRI changes](#cri-changes)).


Perhaps also addition of a RuntimeHandlerFeature for resize? as commented here

that's more for the underlying oci runtimes to report features they have available. We could add features to the runtime status object maybe, though I think the runtimes have been compatible for a while unless the design is changing

there's now RuntimeFeature which we could use if we wanted to

tallclair · 2024-06-07T17:19:55Z

On further discussion, we've decided to push this back to v1.32. There are currently too many unresolved edge cases. I'll update this PR soon to highlight those unresolved conditions.

/milestone v1.32

keps/sig-node/1287-in-place-update-pod-resources/README.md

tallclair · 2024-09-04T17:03:24Z

Still WIP. I'm investigating, and plan to update / add sections on:

Resource quota, specifically how downscaling is handled
How limit resizes work (might need an extension to the checkpoint)
Resizing multiple resources / limits at the same time
Lifecycle nuances, how termination & restart impact various stages of resize
Resizing terminated containers
Rapid overlapping resizes
Resize subresource (do we need it?)

tallclair · 2024-09-07T00:05:32Z

/assign @thockin

tallclair · 2024-09-07T00:06:05Z

This is ready for review. I'll bring up some of these points for discussion in the next SIG-Node meeting.

keps/sig-node/1287-in-place-update-pod-resources/README.md

jpbetz · 2024-10-03T18:45:26Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

 * **Can the feature be disabled once it has been enabled (i.e. can we roll back
  the enablement)?** Yes

+  - The feature should not be disabled on a running node (create a new node instead).


PRR reviewer here: I think I'm following the discussion but want to be sure... Can this be made to work without creating new nodes? If so I think we're all set from a PRR perspective. I don't think we can answer this with "no" and consider the feature production ready.

keps/sig-node/1287-in-place-update-pod-resources/README.md

jpbetz

One minor change request then LGTM for PRR

keps/sig-node/1287-in-place-update-pod-resources/README.md

jpbetz · 2024-10-05T01:45:08Z

/approve
For PRR
Thanks @tallclair Looking forward to seeing this KEP progress!

keps/sig-node/1287-in-place-update-pod-resources/README.md

vinaykul · 2024-10-06T00:05:14Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

@@ -51,6 +63,7 @@
 - [Implementation History](#implementation-history)


One thing we should probably add to the PRR or elsewhere is the possible need to adjust APF (from recent experience with misconfigured priority in prod 😮‍💨 )

Why? I don't think IPPVS increases API calls, does it? Why is APF specifically relevant here?

I guess there's the resize calls themselves, is that the case you're thinking of?

Yeah.. I don't expect this to be a big issue. But it kubelet is lumped together with other node-infra deamonsets in the same FlowSchema, then its impact would be amplified and hard to trace. It's more of a CYA move 😉

vinaykul · 2024-10-06T01:31:52Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

@@ -632,7 +653,7 @@ Pod Status in response to user changing the desired resources in Pod Spec.
  for a Node with 'static' CPU Manager policy, that resize is rejected, and
  an error message is logged to the event stream.
 * To avoid races and possible gamification, all components will use Pod's
-  Status.ContainerStatuses[i].AllocatedResources when computing resources used
+  Status.ContainerStatuses[i].Resources when computing resources used


This could be a problem if the runtime does not support reporting status.Resources (e.g users on containerd<1.6.9) or if runtime is reporting wrong resources.

If the runtime doesn't report the resources, we fall back to the allocated resources (e.g. memory requests). If the runtime reports the wrong resources, I don't think there's anything we can do about it, but I'm not sure we need to design for that?

We have a fallback solution here, and agreed with @tallclair on not-over-engineering if the runtime reports the wrong resources. All we can do is performing some basic sanity checks on the reported values to see if they are within the allowable range based on node capacity and pod requests. I am ok to do nothing for beta at least. WDYT? @vinaykul @tallclair

This might be ok ... if you remember, we were initially setting status...Resources from AllocatedResources upon successful UpdateContainerResources before Lantao found a race in the implementation. We could continue to do that by tracking the last successful UpdateContainerResources values as it brings us closer to last known good configured values.

However, my concern is that by not reporting true values we might hide potential problems that throws off o11y tools and masks alerts where they should be firing off pagers. I don't know enough, let me check with our o11y folks and get back. Also, it is weird that a container which is not running says that it has configured resource requests & limits.

Good point! Let's ensure we do some basic sanity check on the reported values and log something if the value is out of the range.

I had a good discussion with our o11y lead @vjsamuel a short while ago and gave him a quick overview of this feature and proposed changes. He has some interesting insights on how we currently use Requests and how the suggested changes can affect handling in o11y much better than I can.. I know just enough about this to be dangerous. Thanks @vjsamuel for your help!

vinaykul · 2024-10-06T17:49:43Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

+* Multiple containers
+
+These resource requests & limits can have interdependencies that Kubernetes may not be aware of. For
+example, two containers coordinating work may need to be scaled in tandem. It probably doesn't makes


Is there a clear, documented use case for "gang resize" or is this speculation?

I know I leveraged the notion of atomic NodeResourcesFit predicate when computing resize fit to make the case for pod-level ResizeStatus vs. container-level ResizeStatus with @thockin . But we may be making things worse by building it into the design.

Consider this:

We have a node (4 CPUs, 4Gi memory) with two pods: P1(req: {cpu: 2000m, mem: 3000Mi}) and P2(req: {cpu: 1500m, mem: 1000Mi}).

VPA triggers in-place pod resize for P1 updating it to P1(req: {cpu: 3000m, mem: 1000Mi}). But due to atomic resize notion, this request will remain in Deffered state keeping P1 at its original resource allocation levels for all types of resources.

Some time later, a new pod P3(req: {mem: 2000Mi}) comes along. This pod P3 could have run on the node if the above resize request was deconstructed to P1(req: {cpu: 2000m, mem: 1000Mi}) + P1(req: {cpu: 3000m, mem: 1000Mi})). P1 is still Deferred but is now closer to its desired state.

In the above scenario, P3 either remains pending or CA fires up another node just to run P3 - either case increases cost to the customer due to resource waste via over-provisioning.

I added clarification that this is about containers in the same pod, not resizing multiple pods together. There probably is a "gang resize" use case that goes along with gang scheduling, but that's out of scope for now.

WRT the scenario you listed here, I think scaling up CPU and scaling down memory at the same time is going to be extremely unusual. I would expect if CPU is being scaled up, memory will be scaled up or stay the same. Maybe with VPA if it's constantly scaling to match usage this could happen? I'm not sure it's a use case we should optimize for though.

In this case, I am also talking about resizing multiple containers in a single pod with something like the tandem resize constraint (to clarify my “gang resize” terminology above - and yes let’s absolutely leave resizing multiple pods all-or-none ("syndicate resize"? 😓 ) out of scope .. non-goal). Also this is not some corner case thing, it is common for app-owners to over-provision their requests e.g ask for way too much memory when app is CPU-intensive. (We have an infra-team resource partly tasked with tuning these and negotiating with app-teams to lower our DC hardware costs and power footprint). Another (contrived) example is a pod’s memory request & limit increase resize-request being withheld despite plenty of memory being available just because CPU increase part of the same resize request is Deferred (leading to OOM-kills)

If those resizes were to come in as multiple discrete requests and we can handle them with the same outcome, then we should be able to handle it coming in as a single request. A complex resize request is eventually broken down to simple UpdateContainerResources CRI calls to actuate them. I think the same can be done at admit and get the pod as close to desired state as possible. I feel atomic-resize notion at admit can become a footgun that leaves clusters over-provisioned… kinda defeats the purpose of this project.

Thinking out loud, what if we deconstructed a complex resize and admitted what we can (decrease requests will always be admitted), and set status to Deferred or Infeasible if any request cannot be admitted? I see now why @thockin's per-container ResizeStatus makes sense (atleast internally, while keeping pod-level ResizeStatus for VPA to consume). Pod-level resize status will be Infeasible if one or more container requests are Infeasible, and Deferred if one or more requests are Deferred but all are feasible.

I need to think through this very carefully but it’s been a long day and I’m beat rn… however modeling it this way feels like it will work. WDYT?

I'm skeptical that trying to do partial resize is useful or correct, and it seems more likely to fail the "least surprise" test. Kube is level-actuated, kubelet should be driving at the most recent full update, I think.

IIUC, level-actuated principle would apply if two resizes R1 (cpu: 1->2) and R2(cpu: 2->3) both arrive before Kubelet has the chance to admit R1. In that case, we do R(cpu: 1->3), if it can be admitted. And all-or-none at pod atomicity makes sense when scheduling and admitting new pod at Kubelet.

In this case, we have R1(mem: 1->2) which is admissible and R2(cpu: 1->2) which is Deferred. If we get [R1,R2], we would admit R1 and defer R2. But if it came in as [R1+R2] or [R2,R1] then we would head-of-line block R1. The scheduler anyways earmarks resources for both R1 & R2 via max(Requests, AllocatedResources). So why withhold R1 by introducing artificial barriers that are not really necessary and waste those resources?

By admitting what we can when those resources are already reserved, we would bring the pod's actual state closer to its desired state than with pod-level atomic resize. We wouldn't hold up a container image update because a Deferred resize prevents a full update, would we?

+1 on @thockin's points above. I go with the simplicity for now. Getting the in-place pod resizing feature to beta is a significant milestone to us. Focusing on the core functionality and addressing the most common use cases with the atomic approach first allows for a faster beta release. Once the feature is released as beta, we can collect the feedback from users. We can evolve the feature with the partial resizes based on the feedbacks. @vinaykul WDYT?

I'm not in favor of this, defeats the purpose of the project. Make it a GA blocker?

defeats the purpose of the project

That's a huge statement. What we have here is a relatively minor case (and if not, something unforeseen is happening). IMO, having a clearly defined and safe semantic for such a corner case is more important than being exactly right in every case.

Thanks Vinay!

@tallclair

Let's document clearly the limitations of the atomic approach and indicate the partial resizes are not yet supported and the potential for resource waste in certain scenarios. The plan is calling out the inputs from the users who are using the coming beta feature.

Let's also include re-evaluating this for GA too. Based on the data and feedback we have, we can re-evaluate the need for partial resizes before GA. If the data shows significant resource waste or user frustration, prioritize implementing partial resizes as a pre-requisite for GA.

+1 to Dawn's suggestions. I'll update the KEP to include this.

I think we also need to reevaluate this in the context of pod-level resources, so I'd rather wait on a solution.

tallclair · 2024-10-07T17:05:58Z

I realized there's a problem with the proposal that the scheduler uses the max of the desired resources & actual resources: if the container is not running, the actual resources aren't reported. Note that this issue exists with the current implementation as well, since allocated resources are only included for running containers.

One way around this is we could set status...resources to the allocated resources if the container isn't running.

dchen1107 · 2024-10-09T01:47:27Z

I realized there's a problem with the proposal that the scheduler uses the max of the desired resources & actual resources: if the container is not running, the actual resources aren't reported. Note that this issue exists with the current implementation as well, since allocated resources are only included for running containers.

One way around this is we could set status...resources to the allocated resources if the container isn't running.

+1 on setting status.containerStatuses[*].resources to the allocated resources even when the container isn't running:

Preventing the node over-commit is important
Provides the accurate information to the scheduler

I don't see any potential issues caused by this. @vinaykul ?

thockin · 2024-10-09T17:43:27Z

FTR: I am lgtm on this, even though there are some smaller details open

dchen1107 · 2024-10-09T20:15:33Z

/approve from node perspective

@tallclair once you addressed our last comments, please ping us for lgtm

k8s-ci-robot · 2024-10-09T20:15:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, jpbetz, tallclair

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [jpbetz]
~~keps/sig-node/OWNERS~~ [dchen1107]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tallclair · 2024-10-10T00:08:29Z

Added additional GA constraints, and relaxed ResizePolicy to be fully mutable. I think this addresses the remaining open comment threads, I hope I didn't miss anything!

dchen1107 · 2024-10-10T03:04:43Z

/lgtm

tallclair assigned dchen1107 and vinaykul Jun 7, 2024

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 7, 2024

k8s-ci-robot requested review from dchen1107 and derekwaynecarr June 7, 2024 07:03

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 7, 2024

tallclair force-pushed the ippr branch from 0beb94b to a4afa96 Compare June 7, 2024 07:07

InPlacePodVerticalScaling BETA update

68952e4

tallclair force-pushed the ippr branch from a4afa96 to 68952e4 Compare June 7, 2024 07:08

esotsal reviewed Jun 7, 2024

View reviewed changes

k8s-ci-robot added this to the v1.32 milestone Jun 7, 2024

haircommander reviewed Jun 7, 2024

View reviewed changes

keps/sig-node/1287-in-place-update-pod-resources/README.md Show resolved Hide resolved

Rewrite version skew section

453b7fe

tallclair marked this pull request as draft September 4, 2024 16:57

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 4, 2024

tallclair added 3 commits September 4, 2024 16:57

Atomic resizes

60c5d79

Add validation & QOS class sections

ff92a6e

Sidecars, ResourceQuota, Subresource, etc.

d774b5c

tallclair marked this pull request as ready for review September 7, 2024 00:03

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 7, 2024

k8s-ci-robot requested a review from mrunalp September 7, 2024 00:03

k8s-ci-robot assigned thockin Sep 7, 2024

esotsal reviewed Sep 9, 2024

View reviewed changes

keps/sig-node/1287-in-place-update-pod-resources/README.md Show resolved Hide resolved

Regen TOC

1a4fefb

PR feedback

e53dcc5

jpbetz reviewed Oct 3, 2024

View reviewed changes

PRR feedback

ad43d22

tallclair force-pushed the ippr branch from 13dacaa to ad43d22 Compare October 4, 2024 17:09

jpbetz reviewed Oct 4, 2024

View reviewed changes

keps/sig-node/1287-in-place-update-pod-resources/README.md Outdated Show resolved Hide resolved

PRR: feature disablement

9c2404d

vinaykul reviewed Oct 5, 2024

View reviewed changes

keps/sig-node/1287-in-place-update-pod-resources/README.md Outdated Show resolved Hide resolved

vinaykul reviewed Oct 6, 2024

View reviewed changes

Add note about static cpu policy being out of scope

5388d74

Propose exposing allocated resources for non-running containers

8e48db1

marquiz mentioned this pull request Oct 8, 2024

Pass down resources to CRI #4113

Open

tallclair added 3 commits October 9, 2024 08:30

Notes about scheduler race condition

a505a5a

Additional pros/cons for dropping allocated resources

da2e849

Update TOC

7aee63a

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 9, 2024

tallclair added 2 commits October 9, 2024 17:01

Add additional GA requirements

ad13b95

Make ResizePolicy mutable

04de3da

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 10, 2024

k8s-ci-robot merged commit 1e118b1 into kubernetes:master Oct 10, 2024
4 checks passed

tallclair mentioned this pull request Oct 14, 2024

[FG:InPlacePodVerticalScaling] Scheduling race condition kubernetes/kubernetes#126891

Open

		@@ -51,6 +63,7 @@
		- [Implementation History](#implementation-history)

KEP-1287: InPlacePodVerticalScaling BETA update #4704

KEP-1287: InPlacePodVerticalScaling BETA update #4704

Conversation

tallclair commented Jun 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tallclair commented Jun 7, 2024

tallclair commented Sep 4, 2024 • edited Loading

tallclair commented Sep 7, 2024

tallclair commented Sep 7, 2024

Choose a reason for hiding this comment

jpbetz left a comment

Choose a reason for hiding this comment

jpbetz commented Oct 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tallclair commented Oct 7, 2024

dchen1107 commented Oct 9, 2024

thockin commented Oct 9, 2024

dchen1107 commented Oct 9, 2024

k8s-ci-robot commented Oct 9, 2024

tallclair commented Oct 10, 2024

dchen1107 commented Oct 10, 2024

tallclair commented Sep 4, 2024 •

edited

Loading