Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-Place Vertical Pod Scaling KEP to implementable, and mini-KEP for CRI extensions #1342

Merged
merged 16 commits into from
Jan 28, 2020

Conversation

vinaykul
Copy link
Member

One of the items/comments that is related to In-Place Pod Vertical Scaling KEP is to extend/update the CRI API to better support different Container runtimes such as Windows etc for resource update.

This mini KEP outlines the proposed changes to CRI API to accomplish that review item. This does not block implementation of Vertical Scaling KEP, but would be good to have in the time-frame of implementation of In-Place Pod Vertical Scaling feature.

CC: @PatrickLang @dashpole @derekwaynecarr @dchen1107 @yujuhong

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 28, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @vinaykul. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Oct 28, 2019
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 28, 2019
@vinaykul
Copy link
Member Author

/assign @derekwaynecarr @mwielgus

@vinaykul
Copy link
Member Author

/label api-review

@k8s-ci-robot k8s-ci-robot added the api-review Categorizes an issue or PR as actively needing an API review. label Oct 29, 2019
@vinaykul vinaykul mentioned this pull request Oct 29, 2019
31 tasks
@dchen1107
Copy link
Member

cc/ @yliaog on Windows Containers

@dchen1107
Copy link
Member

cc/ @Random-Liu on CRI-containerd for Windows

@vinaykul
Copy link
Member Author

vinaykul commented Nov 9, 2019

@liggitt I'm planning to attend your Live API review session at the K8s contributor summit in San Diego.

If you have additional time, do you think we can review the primary KEP and perhaps this mini-KEP (if applicable) as part of your session, or some other time during KubeCon if you or another reviewer is available?

CC @dashpole - I hope you are coming there :)

@dashpole
Copy link
Contributor

dashpole commented Nov 9, 2019

@vinaykul I added it to my schedule.

Copy link
Member Author

@vinaykul vinaykul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the graduation criteria per sig-node discussion.

@vinaykul
Copy link
Member Author

Here's the API change code preview - vinaykul/kubernetes#1 , specifically commit-id vinaykul/kubernetes@2a1aedd

@liggitt Please review the API change and the admission controller part. Is this what you had in mind?

CC: @dashpole @thockin @derekwaynecarr @dchen1107

@@ -309,7 +334,7 @@ before applying limit increases.

Pod v1 core API:
* extended model,
* new subresource,
* new admission controller,
* added validation.

Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flow described above requires kubelets to update pod spec, which they do not have permission to do today.

That involves changing:

  • the node authorizer to permit kubelets to patch/update the pods resource (not just the pods/status subresource)
  • the NodeRestriction admission plugin to understand what types of updates a kubelet is allowed to make to a pod (we would not want to allow arbitrary label/image updates, for example)

cc @tallclair

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

From what I can see, the simplest way is to introduce admitPodUpdate method to NodeRestriction plugin that verifies only ResourcesAllocated field is being touched, and that node updating the pod owns the pod. I'll try it out and see if that covers it without leaving any holes.

For authorization, I have modified NodeRules() in plugin/pkg/auth/authorizer/rbac/bootstrappolicy/policy.go to allow nodes to update the pod resource. (They are allowed to create and delete pods at this time).

The above approach is consistent with how pod creates and deletes by node are handled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the proposal is scoped to support only cpu and memory resources, is kubelet only authorized to change those values? I am assuming that we would want the kubelet to report all resources allocated and enforced (not just cpu and memory), but we would not want to let a user change the pod spec in validation for anything other than cpu and memory? Is that an accurate understanding?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the alternative is that the pod admission plugin sets allocated for all resources other than cpu/memory, but that would make extending this support to other future resource types challenging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubelet can continue to report status on all resources. I don't see a need to restrict status changes for the new Resources field. However, for ResourcesAllocated field, it is best to start with allowing Node to change what's actually supported now. As we add support for other resource types, we can just add to the list of supported resource types in the admission plugin.

And yes, for the user, we definitely want to lock it down to just what's supported - cpu and memory

@@ -363,13 +388,131 @@ Other components:
could be in use, and approaches such as setting limit near current usage may
be required. This issue needs further investigation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this proposed adding a new field to pod spec, we need to consider the following cases:

  • updates by clients unaware of the new field, which preserve it and send existing the existing value (e.g. dynamic clients using unstructured json requests/responses, or clients using patch)
    • since those clients would not currently be successfully changing resources, there's probably nothing special that needs to be done for these clients
  • updated by clients unaware of the new field, which drop it on update (e.g. old versions of client-go)
    • an update request from a client like this would set the new field to nil. The server must not treat that as an attempt by the client to clear the field (and forbid based on an authorization check, etc), but must maintain compatibility with existing clients by copying the value from the existing pod

Since the ResourcesAllocated field is in pod spec, and pod spec is also used inside pod templates, are we intending to allow/disallow this field to be set inside workload API types (e.g. daemonset, deployment)? Unless we actively prevent it, values can be set for that field in those types, and we have to think through how to handle updates from old clients for those types as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Controllers: propagate Template resources update to running Pod instances, has that been investigated and proven feasible? There are multiple mechanisms controllers use to match up particular child resources with particular generations of the parent resource, and it would be good to know if some (like hashing of the pod template to determine a label for the child resource's selector) are incompatible with in-place update of the pod template without rolling out a new instance of the child.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Pardon my ignorance with admission controllers, I've just started playing with it a few weeks ago. But I believe I should be able to mutate it with the new PodResourceAllocation controller - I'll look deeper into this. Is there a wiki that I can use to experiment with upgrade?

About controllers, we had the propagation working with Job and Deployment controllers in our old design prototype code. But I'll remove this from the scope of the current KEP - VPA cares about updating running pods, and I don't want to commit to it as I need to budget for a few surprises as I do a thorough implementation of the admission control changes and handle upgrade scenario. So we will disallow updating template nested pods. This can always be added as a subsequent enhancement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt I dug a bit more into updating controller templates. Currently, we cannot update Resources field for Job controllers, but allowed to do so for Deployment controllers - it results in Pods being recreated with the new desired resources.

I want to keep the same behavior - if we attempted to disallow it because of this feature, it would be a breaking change.

In 1.19 or another future release, we can perhaps consider propagating the template resource change to running pods (as we had done in our old design PoC). So I'll clarify the KEP to state that current behavior will be maintained for template Pod Resources updates.

Copy link
Member

@liggitt liggitt Jan 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 1.19 or another future release, we can perhaps consider propagating the template resource change to running pods (as we had done in our old design PoC). So I'll clarify the KEP to state that current behavior will be maintained for template Pod Resources updates.

If vertical scaling is only done on individual pod instances, that means a new rollout of a deployment will reset all resource use back to the original levels? Is that acceptable? That seems likely to cause problems if current pods were scaled up in response to load, then a rollout drops capacity back down significantly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is the idea that a separate process would determine the average required resources and propagate that back into the workload template at some interval?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is the idea that a separate process would determine the average required resources and propagate that back into the workload template at some interval?

Yes. Current VPA behavior is to make resource recommendations based on historical measurements and current usage, and optionally apply those recommendations during admission control if the user chooses to allow VPA to control the resources. New recommendations are currently applied by evicting the current pod so that it hits the admission controller.

At this time, we want to keep the current behavior aside from the added ability for VPA to request a pod to be resized without restart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pod admission mutation makes sense as long as that happens prior to quota evaluation.

btw, i appreciate this additional detail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt I'm able to take care of updates from older client-go versions by setting default values on create, and copying old object values on update by handling it in admission controller mutating phase rather than defaults.go. Doing this in defaults.go would attempt to set the values that were dropped by older client-go to default values and this we would lose data.

I was able to test this out by writing a little tool similar to staging/src/k8s.io/client-go/examples/create-update-delete-deployment, but one that calls Pods(ns).Update()

Validation allows Resources and ResourcesAllocated fields to be mutable only for PodSpec, and podresourceallocation and noderestriction plugins handle what user can do and what node can update.

Please review PR vinaykul/kubernetes#1

## Graduation Criteria

TODO
### Alpha
- In-Place Pod Resouces Update functionality is implemented,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for which controllers?

Copy link
Member Author

@vinaykul vinaykul Jan 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the pod for now. I'll update the KEP and remove controller propagation from the scope.

keps/sig-node/20181106-in-place-update-of-pod-resources.md Outdated Show resolved Hide resolved

### Negative Tests
TBD

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this touches a field involved in pod spec, pod template spec, and workload controllers, we need tests to make sure introduction of this does not cause workloads to redeploy on API server upgrade (e.g. kubernetes/kubernetes#78633); tests that look something like what is described in kubernetes/kubernetes#78904, and which are actually run

…Restriction extension to limit what Node can access in PodSpec
Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting really close. I would like to clarify if API server validation enforces that only cpu and memory are allowed to change in the pod spec.

@@ -309,7 +334,7 @@ before applying limit increases.

Pod v1 core API:
* extended model,
* new subresource,
* new admission controller,
* added validation.

Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the proposal is scoped to support only cpu and memory resources, is kubelet only authorized to change those values? I am assuming that we would want the kubelet to report all resources allocated and enforced (not just cpu and memory), but we would not want to let a user change the pod spec in validation for anything other than cpu and memory? Is that an accurate understanding?

@@ -363,13 +388,131 @@ Other components:
could be in use, and approaches such as setting limit near current usage may
be required. This issue needs further investigation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pod admission mutation makes sense as long as that happens prior to quota evaluation.

btw, i appreciate this additional detail.

@@ -309,7 +334,7 @@ before applying limit increases.

Pod v1 core API:
* extended model,
* new subresource,
* new admission controller,
* added validation.

Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the alternative is that the pod admission plugin sets allocated for all resources other than cpu/memory, but that would make extending this support to other future resource types challenging.

@vinaykul
Copy link
Member Author

This is getting really close. I would like to clarify if API server validation enforces that only cpu and memory are allowed to change in the pod spec.

@derekwaynecarr Yes. I'll call this out explicitly in the KEP's affected components section and in the test plan. IIRC someone asked about resizing ephemeral storage, but I have scoped it out of this KEP and listed it as a potential future enhancement.

And same holds for Kubelet authorization as well. During Pod creation, we set default value of ResouresAllocated (SetDefaults_Pod function) equal to Resources.Requests if it is not set. And if it is set by user, we validate that it matches Resource.Requests. (At this time we don't support user requesting a resource allocation different from desired, but @dashpole had brought it up and we discussed it and left it as a possible future extension). Net result is that Node admits a pod at requested resources == resourcesAllocated or not at all (current pod admit behavior)

And yes, I do have our new plugin ordered before ResourceQuota plugin. I'll call it out explicitly in the KEP.

…ode's granular access details on access to PodSpec
@vinaykul
Copy link
Member Author

@liggitt @derekwaynecarr Please see if the last two commits resolve the concerns you have. Thanks,

@derekwaynecarr
Copy link
Member

@vinaykul thank you for the clarifications. this looks good to proceed on implementation.

will allow @liggitt to ack as well.

/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 28, 2020
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2020
@vinaykul
Copy link
Member Author

@derekwaynecarr There was a silly error in the CRI KEP yaml formatting and I had to make a commit to fix that, and it removed the /lgtm label. Could you please lgtm it again? Thanks and sorry for the extra ask.

@liggitt Can you please review and let me know if your issues have been addressed? Thanks.

@vinaykul
Copy link
Member Author

/retest

@k8s-ci-robot
Copy link
Contributor

@vinaykul: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@derekwaynecarr
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, thockin, vinaykul

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@haroon3rd
Copy link

Hi @vinaykul, what is the update on In-Place Vertical Pod Scaling?

@vinaykul
Copy link
Member Author

Hi @vinaykul, what is the update on In-Place Vertical Pod Scaling?

I'm waiting for @derekwaynecarr to review the changes I and @thockin worked out in PR #1883

Once Derek signs off, I plan to start implementation of the new API and design. I'll follow up with him in next week's meeting .. have been busy with other stuff for the past couple of weeks and didn't get to follow up with Derek. I still think we can make it for 1.22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.