-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In-Place Vertical Pod Scaling KEP to implementable, and mini-KEP for CRI extensions #1342
Changes from 13 commits
d8cc582
bf3d694
f9d5553
f6f092f
252a686
d910ba2
f10a2b8
a27c9d3
f6e0760
7340d41
0174e13
167ca41
cad896d
8655fa3
c2b750c
8249ed0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,9 +5,9 @@ authors: | |
- "@bskiba" | ||
- "@schylek" | ||
- "@vinaykul" | ||
owning-sig: sig-autoscaling | ||
owning-sig: sig-node | ||
participating-sigs: | ||
- sig-node | ||
- sig-autoscaling | ||
- sig-scheduling | ||
reviewers: | ||
- "@bsalamat" | ||
|
@@ -23,9 +23,10 @@ approvers: | |
- "@mwielgus" | ||
editor: TBD | ||
creation-date: 2018-11-06 | ||
last-updated: 2018-11-06 | ||
status: provisional | ||
last-updated: 2020-01-14 | ||
status: implementable | ||
see-also: | ||
- "/keps/sig-node/20191025-kubelet-container-resources-cri-api-changes.md" | ||
replaces: | ||
superseded-by: | ||
--- | ||
|
@@ -48,11 +49,20 @@ superseded-by: | |
- [Scheduler and API Server Interaction](#scheduler-and-api-server-interaction) | ||
- [Flow Control](#flow-control) | ||
- [Container resource limit update ordering](#container-resource-limit-update-ordering) | ||
- [Container resource limit update failure handling](#container-resource-limit-update-failure-handling) | ||
- [Notes](#notes) | ||
- [Affected Components](#affected-components) | ||
- [Future Enhancements](#future-enhancements) | ||
- [Risks and Mitigations](#risks-and-mitigations) | ||
- [Test Plan](#test-plan) | ||
- [Unit Tests](#unit-tests) | ||
- [Pod Resize E2E Tests](#pod-resize-e2e-tests) | ||
- [Resource Quota and Limit Ranges](#resource-quota-and-limit-ranges) | ||
- [Resize Policy Tests](#resize-policy-tests) | ||
- [Graduation Criteria](#graduation-criteria) | ||
- [Alpha](#alpha) | ||
- [Beta](#beta) | ||
- [Stable](#stable) | ||
- [Implementation History](#implementation-history) | ||
<!-- /toc --> | ||
|
||
|
@@ -134,15 +144,15 @@ Thanks to the above: | |
v1.ResourceRequirements) shows the **actual** resources held by the Pod and | ||
its Containers. | ||
|
||
A new Pod subresource named 'resourceallocation' is introduced to allow | ||
fine-grained access control that enables Kubelet to set or update resources | ||
allocated to a Pod, and prevents the user or any other component from changing | ||
the allocated resources. | ||
A new admission controller named 'PodResourceAllocation' is introduced in order | ||
to limit access to ResourcesAllocated field such that only Kubelet can update | ||
this field. | ||
|
||
#### Container Resize Policy | ||
|
||
To provide fine-grained user control, PodSpec.Containers is extended with | ||
ResizePolicy map (new object) for each resource type (CPU, memory): | ||
ResizePolicy - a list of named subobjects (new object) that supports 'cpu' | ||
and 'memory' as names. It supports the following policy values: | ||
* NoRestart - the default value; resize Container without restarting it, | ||
* RestartContainer - restart the Container in-place to apply new resource | ||
values. (e.g. Java process needs to change its Xmx flag) | ||
|
@@ -167,6 +177,13 @@ Kubelet calls UpdateContainerResources CRI API which currently takes | |
but not for Windows. This parameter changes to *runtimeapi.ContainerResources*, | ||
that is runtime agnostic, and will contain platform-specific information. | ||
|
||
Additionally, ContainerStatus CRI API is extended to hold | ||
*runtimeapi.ContainerResources* so that it allows Kubelet to query Container's | ||
CPU and memory limit configurations from runtime. | ||
|
||
These CRI changes are a separate effort that does not affect the design | ||
proposed in this KEP. | ||
|
||
### Kubelet and API Server Interaction | ||
|
||
When a new Pod is created, Scheduler is responsible for selecting a suitable | ||
|
@@ -185,11 +202,10 @@ resources allocated (Pod.Spec.Containers[i].ResourcesAllocated) for all Pods in | |
the Node, except the Pod being resized. For the Pod being resized, it adds the | ||
new desired resources (i.e Spec.Containers[i].Resources.Requests) to the sum. | ||
* If new desired resources fit, Kubelet accepts the resize by updating | ||
Pod.Spec.Containers[i].ResourcesAllocated via pods/resourceallocation | ||
subresource, and then proceeds to invoke UpdateContainerResources CRI API | ||
to update the Container resource limits. Once all Containers are successfully | ||
updated, it updates Pod.Status.ContainerStatuses[i].Resources to reflect the | ||
new resource values. | ||
Pod.Spec.Containers[i].ResourcesAllocated, and then proceeds to invoke | ||
UpdateContainerResources CRI API to update Container resource limits. Once | ||
all Containers are successfully updated, it updates | ||
Pod.Status.ContainerStatuses[i].Resources to reflect new resource values. | ||
* If new desired resources don't fit, Kubelet rejects the resize, and no | ||
further action is taken. | ||
- Kubelet retries the Pod resize at a later time. | ||
|
@@ -234,10 +250,9 @@ Pod with ResizePolicy set to NoRestart for all its Containers. | |
resources to determine if the new desired Resources fit the Node. | ||
* _Case 1_: Kubelet finds new desired Resources fit. It accepts the resize | ||
and sets Spec.Containers[i].ResourcesAllocated equal to the values of | ||
Spec.Containers[i].Resources.Requests by invoking resourceallocation | ||
subresource. It then applies the new cgroup limits to the Pod and its | ||
Containers, and once successfully done, sets Pod's | ||
Status.ContainerStatuses[i].Resources to reflect the desired resources. | ||
Spec.Containers[i].Resources.Requests. It then applies the new cgroup | ||
limits to the Pod and its Containers, and once successfully done, sets | ||
Pod's Status.ContainerStatuses[i].Resources to reflect desired resources. | ||
- If at the same time, a new Pod was assigned to this Node against the | ||
capacity taken up by this resource resize, that new Pod is rejected by | ||
Kubelet during admission if Node has no more room. | ||
|
@@ -283,6 +298,16 @@ updates resource limit for the Pod and its Containers in the following manner: | |
In all the above cases, Kubelet applies Container resource limit decreases | ||
before applying limit increases. | ||
|
||
#### Container resource limit update failure handling | ||
|
||
If multiple Containers in a Pod are being updated, and UpdateContainerResources | ||
CRI API fails for any of the containers, Kubelet will backoff and retry at a | ||
later time. Kubelet does not attempt to update limits for containers that are | ||
lined up for update after the failing container. This ensures that sum of the | ||
container limits does not exceed Pod-level cgroup limit at any point. Once all | ||
the container limits have been successfully updated, Kubelet updates the Pod's | ||
Status.ContainerStatuses[i].Resources to match the desired limit values. | ||
|
||
#### Notes | ||
|
||
* If CPU Manager policy for a Node is set to 'static', then only integral | ||
|
@@ -309,7 +334,7 @@ before applying limit increases. | |
|
||
Pod v1 core API: | ||
* extended model, | ||
* new subresource, | ||
* new admission controller, | ||
* added validation. | ||
|
||
Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates: | ||
|
@@ -363,13 +388,131 @@ Other components: | |
could be in use, and approaches such as setting limit near current usage may | ||
be required. This issue needs further investigation. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. since this proposed adding a new field to pod spec, we need to consider the following cases:
Since the ResourcesAllocated field is in pod spec, and pod spec is also used inside pod templates, are we intending to allow/disallow this field to be set inside workload API types (e.g. daemonset, deployment)? Unless we actively prevent it, values can be set for that field in those types, and we have to think through how to handle updates from old clients for those types as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. Pardon my ignorance with admission controllers, I've just started playing with it a few weeks ago. But I believe I should be able to mutate it with the new PodResourceAllocation controller - I'll look deeper into this. Is there a wiki that I can use to experiment with upgrade? About controllers, we had the propagation working with Job and Deployment controllers in our old design prototype code. But I'll remove this from the scope of the current KEP - VPA cares about updating running pods, and I don't want to commit to it as I need to budget for a few surprises as I do a thorough implementation of the admission control changes and handle upgrade scenario. So we will disallow updating template nested pods. This can always be added as a subsequent enhancement. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @liggitt I dug a bit more into updating controller templates. Currently, we cannot update Resources field for Job controllers, but allowed to do so for Deployment controllers - it results in Pods being recreated with the new desired resources. I want to keep the same behavior - if we attempted to disallow it because of this feature, it would be a breaking change. In 1.19 or another future release, we can perhaps consider propagating the template resource change to running pods (as we had done in our old design PoC). So I'll clarify the KEP to state that current behavior will be maintained for template Pod Resources updates. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If vertical scaling is only done on individual pod instances, that means a new rollout of a deployment will reset all resource use back to the original levels? Is that acceptable? That seems likely to cause problems if current pods were scaled up in response to load, then a rollout drops capacity back down significantly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or is the idea that a separate process would determine the average required resources and propagate that back into the workload template at some interval? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. Current VPA behavior is to make resource recommendations based on historical measurements and current usage, and optionally apply those recommendations during admission control if the user chooses to allow VPA to control the resources. New recommendations are currently applied by evicting the current pod so that it hits the admission controller. At this time, we want to keep the current behavior aside from the added ability for VPA to request a pod to be resized without restart. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the pod admission mutation makes sense as long as that happens prior to quota evaluation. btw, i appreciate this additional detail. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @liggitt I'm able to take care of updates from older client-go versions by setting default values on create, and copying old object values on update by handling it in admission controller mutating phase rather than defaults.go. Doing this in defaults.go would attempt to set the values that were dropped by older client-go to default values and this we would lose data. I was able to test this out by writing a little tool similar to staging/src/k8s.io/client-go/examples/create-update-delete-deployment, but one that calls Pods(ns).Update() Validation allows Resources and ResourcesAllocated fields to be mutable only for PodSpec, and podresourceallocation and noderestriction plugins handle what user can do and what node can update. Please review PR vinaykul/kubernetes#1 |
||
## Test Plan | ||
|
||
### Unit Tests | ||
|
||
Unit tests will cover the sanity of code changes that implements the feature, | ||
and the policy controls that are introduced as part of this feature. | ||
|
||
### Pod Resize E2E Tests | ||
|
||
End-to-End tests resize a Pod via PATCH to Pod's Spec.Containers[i].Resources. | ||
The e2e tests use docker as container runtime. | ||
- Resizing of Requests are verified by querying the values in Pod's | ||
Spec.Containers[i].ResourcesAllocated field. | ||
- Resizing of Limits are verified by querying the cgroup limits of the Pod's | ||
containers. | ||
|
||
E2E test cases for Guaranteed class Pod with one container: | ||
1. Increase, decrease Requests & Limits for CPU only. | ||
1. Increase, decrease Requests & Limits for memory only. | ||
1. Increase, decrease Requests & Limits for CPU and memory. | ||
1. Increase CPU and decrease memory. | ||
1. Decrease CPU and increase memory. | ||
|
||
E2E test cases for Burstable class single container Pod that specifies | ||
both CPU & memory: | ||
1. Increase, decrease Requests - CPU only. | ||
1. Increase, decrease Requests - memory only. | ||
1. Increase, decrease Requests - both CPU & memory. | ||
1. Increase, decrease Limits - CPU only. | ||
1. Increase, decrease Limits - memory only. | ||
1. Increase, decrease Limits - both CPU & memory. | ||
1. Increase, decrease Requests & Limits - CPU only. | ||
1. Increase, decrease Requests & Limits - memory only. | ||
1. Increase, decrease Requests & Limits - both CPU and memory. | ||
1. Increase CPU (Requests+Limits) & decrease memory(Requests+Limits). | ||
1. Decrease CPU (Requests+Limits) & increase memory(Requests+Limits). | ||
1. Increase CPU Requests while decreasing CPU Limits. | ||
1. Decrease CPU Requests while increasing CPU Limits. | ||
1. Increase memory Requests while decreasing memory Limits. | ||
1. Decrease memory Requests while increasing memory Limits. | ||
1. CPU: increase Requests, decrease Limits, Memory: increase Requests, decrease Limits. | ||
1. CPU: decrease Requests, increase Limits, Memory: decrease Requests, increase Limits. | ||
|
||
E2E tests for Burstable class single container Pod that specifies CPU only: | ||
1. Increase, decrease CPU - Requests only. | ||
1. Increase, decrease CPU - Limits only. | ||
1. Increase, decrease CPU - both Requests & Limits. | ||
|
||
E2E tests for Burstable class single container Pod that specifies memory only: | ||
1. Increase, decrease memory - Requests only. | ||
1. Increase, decrease memory - Limits only. | ||
1. Increase, decrease memory - both Requests & Limits. | ||
|
||
E2E tests for Guaranteed class Pod with three containers (c1, c2, c3): | ||
1. Increase CPU & memory for all three containers. | ||
1. Decrease CPU & memory for all three containers. | ||
1. Increase CPU, decrease memory for all three containers. | ||
1. Decrease CPU, increase memory for all three containers. | ||
1. Increase CPU for c1, decrease c2, c3 unchanged - no net CPU change. | ||
1. Increase memory for c1, decrease c2, c3 unchanged - no net memory change. | ||
1. Increase CPU for c1, decrease c2 & c3 - net CPU decrease for Pod. | ||
1. Increase memory for c1, decrease c2 & c3 - net memory decrease for Pod. | ||
1. Increase CPU for c1 & c3, decrease c2 - net CPU increase for Pod. | ||
1. Increase memory for c1 & c3, decrease c2 - net memory increase for Pod. | ||
|
||
### Resource Quota and Limit Ranges | ||
|
||
Setup a namespace with ResourceQuota and a single, valid Pod. | ||
1. Resize the Pod within resource quota - CPU only. | ||
1. Resize the Pod within resource quota - memory only. | ||
1. Resize the Pod within resource quota - both CPU and memory. | ||
1. Resize the Pod to exceed resource quota - CPU only. | ||
1. Resize the Pod to exceed resource quota - memory only. | ||
1. Resize the Pod to exceed resource quota - both CPU and memory. | ||
|
||
Setup a namespace with min and max LimitRange and create a single, valid Pod. | ||
1. Increase, decrease CPU within min/max bounds. | ||
1. Increase CPU to exceed max value. | ||
1. Decrease CPU to go below min value. | ||
1. Increase memory to exceed max value. | ||
1. Decrease memory to go below min value. | ||
|
||
### Resize Policy Tests | ||
|
||
Setup a guaranteed class Pod with two containers (c1 & c2). | ||
1. No resize policy specified, defaults to NoRestart. Verify that CPU and | ||
memory are resized without restarting containers. | ||
1. NoRestart (cpu, memory) policy for c1, RestartContainer (cpu, memory) for c2. | ||
Verify that c1 is resized without restart, c2 is restarted on resize. | ||
1. NoRestart cpu, RestartContainer memory policy for c1. Resize c1 CPU only, | ||
verify container is resized without restart. | ||
1. NoRestart cpu, RestartContainer memory policy for c1. Resize c1 memory only, | ||
verify container is resized with restart. | ||
1. NoRestart cpu, RestartContainer memory policy for c1. Resize c1 CPU & memory, | ||
verify container is resized with restart. | ||
|
||
### Negative Tests | ||
TBD | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given this touches a field involved in pod spec, pod template spec, and workload controllers, we need tests to make sure introduction of this does not cause workloads to redeploy on API server upgrade (e.g. kubernetes/kubernetes#78633); tests that look something like what is described in kubernetes/kubernetes#78904, and which are actually run |
||
## Graduation Criteria | ||
|
||
TODO | ||
### Alpha | ||
- In-Place Pod Resouces Update functionality is implemented, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for which controllers? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just the pod for now. I'll update the KEP and remove controller propagation from the scope.
liggitt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- LimitRanger and ResourceQuota handling are added, | ||
- Resize Policies functionality is implemented, | ||
- Unit tests and E2E tests covering basic functionality are added, | ||
- E2E tests covering multiple containers are added. | ||
|
||
### Beta | ||
- VPA alpha integration of feature completed and any bugs addressed, | ||
- E2E tests covering Resize Policy, LimitRanger, and ResourceQuota are added, | ||
- Negative tests are identified and added. | ||
|
||
### Stable | ||
- VPA integration of feature moved to beta, | ||
- User feedback (ideally from atleast two distinct users) is green, | ||
- No major bugs reported for three months. | ||
|
||
## Implementation History | ||
|
||
- 2018-11-06 - initial KEP draft created | ||
- 2019-01-18 - implementation proposal extended | ||
- 2019-03-07 - changes to flow control, updates per review feedback | ||
- 2019-08-29 - updated design proposal | ||
- 2019-10-25 - update key open items and move KEP to implementable | ||
- 2020-01-06 - API review suggested changes incorporated | ||
- 2020-01-13 - Test plan and graduation criteria added | ||
- 2020-01-21 - Graduation criteria updated per review feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flow described above requires kubelets to update pod spec, which they do not have permission to do today.
That involves changing:
pods
resource (not just the pods/status subresource)cc @tallclair
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
From what I can see, the simplest way is to introduce admitPodUpdate method to NodeRestriction plugin that verifies only ResourcesAllocated field is being touched, and that node updating the pod owns the pod. I'll try it out and see if that covers it without leaving any holes.
For authorization, I have modified NodeRules() in plugin/pkg/auth/authorizer/rbac/bootstrappolicy/policy.go to allow nodes to update the pod resource. (They are allowed to create and delete pods at this time).
The above approach is consistent with how pod creates and deletes by node are handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the proposal is scoped to support only cpu and memory resources, is kubelet only authorized to change those values? I am assuming that we would want the kubelet to report all resources allocated and enforced (not just cpu and memory), but we would not want to let a user change the pod spec in validation for anything other than cpu and memory? Is that an accurate understanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the alternative is that the pod admission plugin sets allocated for all resources other than cpu/memory, but that would make extending this support to other future resource types challenging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubelet can continue to report status on all resources. I don't see a need to restrict status changes for the new Resources field. However, for ResourcesAllocated field, it is best to start with allowing Node to change what's actually supported now. As we add support for other resource types, we can just add to the list of supported resource types in the admission plugin.
And yes, for the user, we definitely want to lock it down to just what's supported - cpu and memory