-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-1287: InPlacePodVerticalScaling BETA update #4704
Conversation
- Impact of its outage on the feature: | ||
- Impact of its degraded performance or high-error rates on the feature: | ||
|
||
Compatible container runtime (see [CRI changes](#cri-changes)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps also addition of a RuntimeHandlerFeature for resize? as commented here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's more for the underlying oci runtimes to report features they have available. We could add features
to the runtime status object maybe, though I think the runtimes have been compatible for a while unless the design is changing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's now RuntimeFeature which we could use if we wanted to
On further discussion, we've decided to push this back to v1.32. There are currently too many unresolved edge cases. I'll update this PR soon to highlight those unresolved conditions. /milestone v1.32 |
Still WIP. I'm investigating, and plan to update / add sections on:
|
/assign @thockin |
This is ready for review. I'll bring up some of these points for discussion in the next SIG-Node meeting. |
* **Can the feature be disabled once it has been enabled (i.e. can we roll back | ||
the enablement)?** Yes | ||
|
||
- The feature should not be disabled on a running node (create a new node instead). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PRR reviewer here: I think I'm following the discussion but want to be sure... Can this be made to work without creating new nodes? If so I think we're all set from a PRR perspective. I don't think we can answer this with "no" and consider the feature production ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor change request then LGTM for PRR
/approve |
@@ -51,6 +63,7 @@ | |||
- [Implementation History](#implementation-history) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing we should probably add to the PRR or elsewhere is the possible need to adjust APF (from recent experience with misconfigured priority in prod 😮💨 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I don't think IPPVS increases API calls, does it? Why is APF specifically relevant here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there's the resize calls themselves, is that the case you're thinking of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.. I don't expect this to be a big issue. But it kubelet is lumped together with other node-infra deamonsets in the same FlowSchema, then its impact would be amplified and hard to trace. It's more of a CYA move 😉
@@ -632,7 +653,7 @@ Pod Status in response to user changing the desired resources in Pod Spec. | |||
for a Node with 'static' CPU Manager policy, that resize is rejected, and | |||
an error message is logged to the event stream. | |||
* To avoid races and possible gamification, all components will use Pod's | |||
Status.ContainerStatuses[i].AllocatedResources when computing resources used | |||
Status.ContainerStatuses[i].Resources when computing resources used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a problem if the runtime does not support reporting status.Resources (e.g users on containerd<1.6.9) or if runtime is reporting wrong resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the runtime doesn't report the resources, we fall back to the allocated resources (e.g. memory requests). If the runtime reports the wrong resources, I don't think there's anything we can do about it, but I'm not sure we need to design for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a fallback solution here, and agreed with @tallclair on not-over-engineering if the runtime reports the wrong resources. All we can do is performing some basic sanity checks on the reported values to see if they are within the allowable range based on node capacity and pod requests. I am ok to do nothing for beta at least. WDYT? @vinaykul @tallclair
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be ok ... if you remember, we were initially setting status...Resources from AllocatedResources upon successful UpdateContainerResources
before Lantao found a race in the implementation. We could continue to do that by tracking the last successful UpdateContainerResources values as it brings us closer to last known good configured values.
However, my concern is that by not reporting true values we might hide potential problems that throws off o11y tools and masks alerts where they should be firing off pagers. I don't know enough, let me check with our o11y folks and get back. Also, it is weird that a container which is not running says that it has configured resource requests & limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Let's ensure we do some basic sanity check on the reported values and log something if the value is out of the range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a good discussion with our o11y lead @vjsamuel a short while ago and gave him a quick overview of this feature and proposed changes. He has some interesting insights on how we currently use Requests and how the suggested changes can affect handling in o11y much better than I can.. I know just enough about this to be dangerous. Thanks @vjsamuel for your help!
* Multiple containers | ||
|
||
These resource requests & limits can have interdependencies that Kubernetes may not be aware of. For | ||
example, two containers coordinating work may need to be scaled in tandem. It probably doesn't makes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a clear, documented use case for "gang resize" or is this speculation?
I know I leveraged the notion of atomic NodeResourcesFit
predicate when computing resize fit to make the case for pod-level ResizeStatus vs. container-level ResizeStatus with @thockin . But we may be making things worse by building it into the design.
Consider this:
- We have a node (4 CPUs, 4Gi memory) with two pods: P1(req: {cpu: 2000m, mem: 3000Mi}) and P2(req: {cpu: 1500m, mem: 1000Mi}).
- VPA triggers in-place pod resize for P1 updating it to P1(req: {cpu: 3000m, mem: 1000Mi}). But due to atomic resize notion, this request will remain in
Deffered
state keeping P1 at its original resource allocation levels for all types of resources. - Some time later, a new pod P3(req: {mem: 2000Mi}) comes along. This pod P3 could have run on the node if the above resize request was deconstructed to P1(req: {cpu: 2000m, mem: 1000Mi}) + P1(req: {cpu: 3000m, mem: 1000Mi})). P1 is still
Deferred
but is now closer to its desired state.
In the above scenario, P3 either remains pending or CA fires up another node just to run P3 - either case increases cost to the customer due to resource waste via over-provisioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added clarification that this is about containers in the same pod, not resizing multiple pods together. There probably is a "gang resize" use case that goes along with gang scheduling, but that's out of scope for now.
WRT the scenario you listed here, I think scaling up CPU and scaling down memory at the same time is going to be extremely unusual. I would expect if CPU is being scaled up, memory will be scaled up or stay the same. Maybe with VPA if it's constantly scaling to match usage this could happen? I'm not sure it's a use case we should optimize for though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I am also talking about resizing multiple containers in a single pod with something like the tandem resize constraint (to clarify my “gang resize” terminology above - and yes let’s absolutely leave resizing multiple pods all-or-none ("syndicate resize"? 😓 ) out of scope .. non-goal). Also this is not some corner case thing, it is common for app-owners to over-provision their requests e.g ask for way too much memory when app is CPU-intensive. (We have an infra-team resource partly tasked with tuning these and negotiating with app-teams to lower our DC hardware costs and power footprint). Another (contrived) example is a pod’s memory request & limit increase resize-request being withheld despite plenty of memory being available just because CPU increase part of the same resize request is Deferred (leading to OOM-kills)
If those resizes were to come in as multiple discrete requests and we can handle them with the same outcome, then we should be able to handle it coming in as a single request. A complex resize request is eventually broken down to simple UpdateContainerResources CRI calls to actuate them. I think the same can be done at admit and get the pod as close to desired state as possible. I feel atomic-resize notion at admit can become a footgun that leaves clusters over-provisioned… kinda defeats the purpose of this project.
Thinking out loud, what if we deconstructed a complex resize and admitted what we can (decrease requests will always be admitted), and set status to Deferred or Infeasible if any request cannot be admitted? I see now why @thockin's per-container ResizeStatus makes sense (atleast internally, while keeping pod-level ResizeStatus for VPA to consume). Pod-level resize status will be Infeasible if one or more container requests are Infeasible, and Deferred if one or more requests are Deferred but all are feasible.
I need to think through this very carefully but it’s been a long day and I’m beat rn… however modeling it this way feels like it will work. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm skeptical that trying to do partial resize is useful or correct, and it seems more likely to fail the "least surprise" test. Kube is level-actuated, kubelet should be driving at the most recent full update, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, level-actuated principle would apply if two resizes R1 (cpu: 1->2) and R2(cpu: 2->3) both arrive before Kubelet has the chance to admit R1. In that case, we do R(cpu: 1->3), if it can be admitted. And all-or-none at pod atomicity makes sense when scheduling and admitting new pod at Kubelet.
In this case, we have R1(mem: 1->2) which is admissible and R2(cpu: 1->2) which is Deferred. If we get [R1,R2], we would admit R1 and defer R2. But if it came in as [R1+R2] or [R2,R1] then we would head-of-line block R1. The scheduler anyways earmarks resources for both R1 & R2 via max(Requests, AllocatedResources). So why withhold R1 by introducing artificial barriers that are not really necessary and waste those resources?
By admitting what we can when those resources are already reserved, we would bring the pod's actual state closer to its desired state than with pod-level atomic resize. We wouldn't hold up a container image update because a Deferred resize prevents a full update, would we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on @thockin's points above. I go with the simplicity for now. Getting the in-place pod resizing feature to beta is a significant milestone to us. Focusing on the core functionality and addressing the most common use cases with the atomic approach first allows for a faster beta release. Once the feature is released as beta, we can collect the feedback from users. We can evolve the feature with the partial resizes based on the feedbacks. @vinaykul WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not in favor of this, defeats the purpose of the project. Make it a GA blocker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defeats the purpose of the project
That's a huge statement. What we have here is a relatively minor case (and if not, something unforeseen is happening). IMO, having a clearly defined and safe semantic for such a corner case is more important than being exactly right in every case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Vinay!
- Let's document clearly the limitations of the atomic approach and indicate the partial resizes are not yet supported and the potential for resource waste in certain scenarios. The plan is calling out the inputs from the users who are using the coming beta feature.
- Let's also include re-evaluating this for GA too. Based on the data and feedback we have, we can re-evaluate the need for partial resizes before GA. If the data shows significant resource waste or user frustration, prioritize implementing partial resizes as a pre-requisite for GA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Dawn's suggestions. I'll update the KEP to include this.
I think we also need to reevaluate this in the context of pod-level resources, so I'd rather wait on a solution.
I realized there's a problem with the proposal that the scheduler uses the max of the desired resources & actual resources: if the container is not running, the actual resources aren't reported. Note that this issue exists with the current implementation as well, since allocated resources are only included for running containers. One way around this is we could set |
+1 on setting status.containerStatuses[*].resources to the allocated resources even when the container isn't running:
I don't see any potential issues caused by this. @vinaykul ? |
FTR: I am lgtm on this, even though there are some smaller details open |
/approve from node perspective @tallclair once you addressed our last comments, please ping us for lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, jpbetz, tallclair The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Added additional GA constraints, and relaxed ResizePolicy to be fully mutable. I think this addresses the remaining open comment threads, I hope I didn't miss anything! |
/lgtm |
One-line PR description: Update PRR beta requirements sections
Issue link: In-Place Update of Pod Resources #1287
Other comments