-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In-Place Vertical Pod Scaling KEP to implementable, and mini-KEP for CRI extensions #1342
Conversation
Hi @vinaykul. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @derekwaynecarr @mwielgus |
/label api-review |
cc/ @yliaog on Windows Containers |
cc/ @Random-Liu on CRI-containerd for Windows |
keps/sig-autoscaling/20181106-in-place-update-of-pod-resources.md
Outdated
Show resolved
Hide resolved
keps/sig-node/20191025-kubelet-container-resources-cri-api-changes.md
Outdated
Show resolved
Hide resolved
@liggitt I'm planning to attend your Live API review session at the K8s contributor summit in San Diego. If you have additional time, do you think we can review the primary KEP and perhaps this mini-KEP (if applicable) as part of your session, or some other time during KubeCon if you or another reviewer is available? CC @dashpole - I hope you are coming there :) |
@vinaykul I added it to my schedule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the graduation criteria per sig-node discussion.
Here's the API change code preview - vinaykul/kubernetes#1 , specifically commit-id vinaykul/kubernetes@2a1aedd @liggitt Please review the API change and the admission controller part. Is this what you had in mind? |
@@ -309,7 +334,7 @@ before applying limit increases. | |||
|
|||
Pod v1 core API: | |||
* extended model, | |||
* new subresource, | |||
* new admission controller, | |||
* added validation. | |||
|
|||
Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flow described above requires kubelets to update pod spec, which they do not have permission to do today.
That involves changing:
- the node authorizer to permit kubelets to patch/update the
pods
resource (not just the pods/status subresource) - the NodeRestriction admission plugin to understand what types of updates a kubelet is allowed to make to a pod (we would not want to allow arbitrary label/image updates, for example)
cc @tallclair
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
From what I can see, the simplest way is to introduce admitPodUpdate method to NodeRestriction plugin that verifies only ResourcesAllocated field is being touched, and that node updating the pod owns the pod. I'll try it out and see if that covers it without leaving any holes.
For authorization, I have modified NodeRules() in plugin/pkg/auth/authorizer/rbac/bootstrappolicy/policy.go to allow nodes to update the pod resource. (They are allowed to create and delete pods at this time).
The above approach is consistent with how pod creates and deletes by node are handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the proposal is scoped to support only cpu and memory resources, is kubelet only authorized to change those values? I am assuming that we would want the kubelet to report all resources allocated and enforced (not just cpu and memory), but we would not want to let a user change the pod spec in validation for anything other than cpu and memory? Is that an accurate understanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the alternative is that the pod admission plugin sets allocated for all resources other than cpu/memory, but that would make extending this support to other future resource types challenging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubelet can continue to report status on all resources. I don't see a need to restrict status changes for the new Resources field. However, for ResourcesAllocated field, it is best to start with allowing Node to change what's actually supported now. As we add support for other resource types, we can just add to the list of supported resource types in the admission plugin.
And yes, for the user, we definitely want to lock it down to just what's supported - cpu and memory
@@ -363,13 +388,131 @@ Other components: | |||
could be in use, and approaches such as setting limit near current usage may | |||
be required. This issue needs further investigation. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this proposed adding a new field to pod spec, we need to consider the following cases:
- updates by clients unaware of the new field, which preserve it and send existing the existing value (e.g. dynamic clients using unstructured json requests/responses, or clients using patch)
- since those clients would not currently be successfully changing resources, there's probably nothing special that needs to be done for these clients
- updated by clients unaware of the new field, which drop it on update (e.g. old versions of client-go)
- an update request from a client like this would set the new field to nil. The server must not treat that as an attempt by the client to clear the field (and forbid based on an authorization check, etc), but must maintain compatibility with existing clients by copying the value from the existing pod
Since the ResourcesAllocated field is in pod spec, and pod spec is also used inside pod templates, are we intending to allow/disallow this field to be set inside workload API types (e.g. daemonset, deployment)? Unless we actively prevent it, values can be set for that field in those types, and we have to think through how to handle updates from old clients for those types as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Controllers: propagate Template resources update to running Pod instances
, has that been investigated and proven feasible? There are multiple mechanisms controllers use to match up particular child resources with particular generations of the parent resource, and it would be good to know if some (like hashing of the pod template to determine a label for the child resource's selector) are incompatible with in-place update of the pod template without rolling out a new instance of the child.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Pardon my ignorance with admission controllers, I've just started playing with it a few weeks ago. But I believe I should be able to mutate it with the new PodResourceAllocation controller - I'll look deeper into this. Is there a wiki that I can use to experiment with upgrade?
About controllers, we had the propagation working with Job and Deployment controllers in our old design prototype code. But I'll remove this from the scope of the current KEP - VPA cares about updating running pods, and I don't want to commit to it as I need to budget for a few surprises as I do a thorough implementation of the admission control changes and handle upgrade scenario. So we will disallow updating template nested pods. This can always be added as a subsequent enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt I dug a bit more into updating controller templates. Currently, we cannot update Resources field for Job controllers, but allowed to do so for Deployment controllers - it results in Pods being recreated with the new desired resources.
I want to keep the same behavior - if we attempted to disallow it because of this feature, it would be a breaking change.
In 1.19 or another future release, we can perhaps consider propagating the template resource change to running pods (as we had done in our old design PoC). So I'll clarify the KEP to state that current behavior will be maintained for template Pod Resources updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 1.19 or another future release, we can perhaps consider propagating the template resource change to running pods (as we had done in our old design PoC). So I'll clarify the KEP to state that current behavior will be maintained for template Pod Resources updates.
If vertical scaling is only done on individual pod instances, that means a new rollout of a deployment will reset all resource use back to the original levels? Is that acceptable? That seems likely to cause problems if current pods were scaled up in response to load, then a rollout drops capacity back down significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is the idea that a separate process would determine the average required resources and propagate that back into the workload template at some interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is the idea that a separate process would determine the average required resources and propagate that back into the workload template at some interval?
Yes. Current VPA behavior is to make resource recommendations based on historical measurements and current usage, and optionally apply those recommendations during admission control if the user chooses to allow VPA to control the resources. New recommendations are currently applied by evicting the current pod so that it hits the admission controller.
At this time, we want to keep the current behavior aside from the added ability for VPA to request a pod to be resized without restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pod admission mutation makes sense as long as that happens prior to quota evaluation.
btw, i appreciate this additional detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt I'm able to take care of updates from older client-go versions by setting default values on create, and copying old object values on update by handling it in admission controller mutating phase rather than defaults.go. Doing this in defaults.go would attempt to set the values that were dropped by older client-go to default values and this we would lose data.
I was able to test this out by writing a little tool similar to staging/src/k8s.io/client-go/examples/create-update-delete-deployment, but one that calls Pods(ns).Update()
Validation allows Resources and ResourcesAllocated fields to be mutable only for PodSpec, and podresourceallocation and noderestriction plugins handle what user can do and what node can update.
Please review PR vinaykul/kubernetes#1
## Graduation Criteria | ||
|
||
TODO | ||
### Alpha | ||
- In-Place Pod Resouces Update functionality is implemented, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for which controllers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the pod for now. I'll update the KEP and remove controller propagation from the scope.
|
||
### Negative Tests | ||
TBD | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this touches a field involved in pod spec, pod template spec, and workload controllers, we need tests to make sure introduction of this does not cause workloads to redeploy on API server upgrade (e.g. kubernetes/kubernetes#78633); tests that look something like what is described in kubernetes/kubernetes#78904, and which are actually run
…Restriction extension to limit what Node can access in PodSpec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting really close. I would like to clarify if API server validation enforces that only cpu and memory are allowed to change in the pod spec.
@@ -309,7 +334,7 @@ before applying limit increases. | |||
|
|||
Pod v1 core API: | |||
* extended model, | |||
* new subresource, | |||
* new admission controller, | |||
* added validation. | |||
|
|||
Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the proposal is scoped to support only cpu and memory resources, is kubelet only authorized to change those values? I am assuming that we would want the kubelet to report all resources allocated and enforced (not just cpu and memory), but we would not want to let a user change the pod spec in validation for anything other than cpu and memory? Is that an accurate understanding?
@@ -363,13 +388,131 @@ Other components: | |||
could be in use, and approaches such as setting limit near current usage may | |||
be required. This issue needs further investigation. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pod admission mutation makes sense as long as that happens prior to quota evaluation.
btw, i appreciate this additional detail.
@@ -309,7 +334,7 @@ before applying limit increases. | |||
|
|||
Pod v1 core API: | |||
* extended model, | |||
* new subresource, | |||
* new admission controller, | |||
* added validation. | |||
|
|||
Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the alternative is that the pod admission plugin sets allocated for all resources other than cpu/memory, but that would make extending this support to other future resource types challenging.
@derekwaynecarr Yes. I'll call this out explicitly in the KEP's affected components section and in the test plan. IIRC someone asked about resizing ephemeral storage, but I have scoped it out of this KEP and listed it as a potential future enhancement. And same holds for Kubelet authorization as well. During Pod creation, we set default value of ResouresAllocated (SetDefaults_Pod function) equal to Resources.Requests if it is not set. And if it is set by user, we validate that it matches Resource.Requests. (At this time we don't support user requesting a resource allocation different from desired, but @dashpole had brought it up and we discussed it and left it as a possible future extension). Net result is that Node admits a pod at requested resources == resourcesAllocated or not at all (current pod admit behavior) And yes, I do have our new plugin ordered before ResourceQuota plugin. I'll call it out explicitly in the KEP. |
…ode's granular access details on access to PodSpec
@liggitt @derekwaynecarr Please see if the last two commits resolve the concerns you have. Thanks, |
@derekwaynecarr There was a silly error in the CRI KEP yaml formatting and I had to make a commit to fix that, and it removed the /lgtm label. Could you please lgtm it again? Thanks and sorry for the extra ask. @liggitt Can you please review and let me know if your issues have been addressed? Thanks. |
/retest |
@vinaykul: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr, thockin, vinaykul The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @vinaykul, what is the update on In-Place Vertical Pod Scaling? |
I'm waiting for @derekwaynecarr to review the changes I and @thockin worked out in PR #1883 Once Derek signs off, I plan to start implementation of the new API and design. I'll follow up with him in next week's meeting .. have been busy with other stuff for the past couple of weeks and didn't get to follow up with Derek. I still think we can make it for 1.22 |
One of the items/comments that is related to In-Place Pod Vertical Scaling KEP is to extend/update the CRI API to better support different Container runtimes such as Windows etc for resource update.
This mini KEP outlines the proposed changes to CRI API to accomplish that review item. This does not block implementation of Vertical Scaling KEP, but would be good to have in the time-frame of implementation of In-Place Pod Vertical Scaling feature.
CC: @PatrickLang @dashpole @derekwaynecarr @dchen1107 @yujuhong