-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: in-place update of pod resources #686
Changes from 19 commits
cd94808
7fb66f1
b8c1f4e
5d00f9f
9580642
b8d814e
df1c8f8
17923eb
e5052fc
1194243
bfab6a3
69f9190
199a008
574737c
5bdcd57
bc9dc2b
533c3c6
29a22b6
20cbea6
0ed9505
c745563
55c8e56
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,368 @@ | ||
--- | ||
title: In-place Update of Pod Resources | ||
authors: | ||
- "@kgolab" | ||
- "@bskiba" | ||
- "@schylek" | ||
- "@vinaykul" | ||
owning-sig: sig-autoscaling | ||
participating-sigs: | ||
- sig-node | ||
- sig-scheduling | ||
reviewers: | ||
- "@bsalamat" | ||
- "@derekwaynecarr" | ||
- "@dchen1107" | ||
approvers: | ||
- TBD | ||
editor: TBD | ||
creation-date: 2018-11-06 | ||
last-updated: 2018-11-06 | ||
status: provisional | ||
see-also: | ||
replaces: | ||
superseded-by: | ||
--- | ||
|
||
# In-place Update of Pod Resources | ||
|
||
## Table of Contents | ||
|
||
* [In-place Update of Pod Resources](#in-place-update-of-pod-resources) | ||
* [Table of Contents](#table-of-contents) | ||
* [Summary](#summary) | ||
* [Motivation](#motivation) | ||
* [Goals](#goals) | ||
* [Non-Goals](#non-goals) | ||
* [Proposal](#proposal) | ||
* [API Changes](#api-changes) | ||
* [Container Resize Policy](#container-resize-policy) | ||
* [CRI Changes](#cri-changes) | ||
* [Kubelet and API Server interaction](#kubelet-and-api-server-interaction) | ||
* [Kubelet Restart Tolerance](#kubelet-restart-tolerance) | ||
* [Scheduler and API Server interaction](#scheduler-and-api-server-interaction) | ||
* [Flow Control](#flow-control) | ||
* [Container resource limit update ordering](#container-resource-limit-update-ordering) | ||
* [Notes](#notes) | ||
* [Affected Components](#affected-components) | ||
* [Future Enhancements](#future-enhancements) | ||
* [Risks and Mitigations](#risks-and-mitigations) | ||
* [Graduation Criteria](#graduation-criteria) | ||
* [Implementation History](#implementation-history) | ||
|
||
## Summary | ||
|
||
This proposal aims at allowing Pod resource requests & limits to be updated | ||
in-place, without a need to restart the Pod or its Containers. | ||
|
||
The **core idea** behind the proposal is to make PodSpec mutable with regards to | ||
Resources, denoting **desired** resources. Additionally, PodSpec is extended to | ||
reflect resources **allocated** to a Pod, and PodStatus is extended to provide | ||
information about **actual** resources applied to the Pod and its Containers. | ||
|
||
This document builds upon [proposal for live and in-place vertical scaling][] | ||
and [Vertical Resources Scaling in Kubernetes][]. | ||
|
||
[proposal for live and in-place vertical scaling]: | ||
https://github.com/kubernetes/community/pull/1719 | ||
[Vertical Resources Scaling in Kubernetes]: | ||
https://docs.google.com/document/d/18K-bl1EVsmJ04xeRq9o_vfY2GDgek6B6wmLjXw-kos4 | ||
|
||
## Motivation | ||
|
||
Resources allocated to a Pod's Container(s) can require a change for various | ||
reasons: | ||
* load handled by the Pod has increased significantly, and current resources | ||
are not sufficient, | ||
* load has decreased significantly, and allocated resources are unused, | ||
* resources have simply been set improperly. | ||
|
||
Currently, changing resource allocation requires the Pod to be recreated since | ||
the PodSpec's Container Resources is immutable. | ||
|
||
While many stateless workloads are designed to withstand such a disruption, | ||
some are more sensitive, especially when using low number of Pod replicas. | ||
|
||
Moreover, for stateful or batch workloads, Pod restart is a serious disruption, | ||
resulting in lower availability or higher cost of running. | ||
|
||
Allowing Resources to be changed without recreating the Pod or restarting the | ||
Containers addresses this issue directly. | ||
|
||
### Goals | ||
|
||
* Primary: allow to change Pod resource requests & limits without restarting | ||
its Containers. | ||
* Secondary: allow actors (users, VPA, StatefulSet, JobController) to decide | ||
how to proceed if in-place resource resize is not possible. | ||
* Secondary: allow users to specify which Pods and Containers can be resized | ||
without a restart. | ||
dchen1107 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Non-Goals | ||
|
||
The explicit non-goal of this KEP is to avoid controlling full lifecycle of a | ||
Pod which failed in-place resource resizing. This should be handled by actors | ||
which initiated the resizing. | ||
|
||
Other identified non-goals are: | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* allow to change Pod QoS class without a restart, | ||
* to change resources of Init Containers without a restart, | ||
* eviction of lower priority Pods to facilitate Pod resize, | ||
* updating extended resources or any other resource types besides CPU, memory. | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Proposal | ||
|
||
### API Changes | ||
|
||
PodSpec becomes mutable with regards to Container resources requests and | ||
limits. PodSpec is extended with information of resources allocated on the | ||
Node for the Pod. PodStatus is extended to show the actual resources applied | ||
to the Pod and its Containers. | ||
|
||
Thanks to the above: | ||
* Pod.Spec.Containers[i].Resources becomes purely a declaration, denoting the | ||
**desired** state of Pod resources, | ||
* Pod.Spec.Containers[i].ResourcesAllocated (new object, type v1.ResourceList) | ||
denotes the Node resources **allocated** to the Pod and its Containers, | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Pod.Status.ContainerStatuses[i].Resources (new object, type | ||
v1.ResourceRequirements) shows the **actual** resources held by the Pod and | ||
its Containers. | ||
|
||
A new Pod subresource named 'resourceallocation' is introduced to allow | ||
fine-grained access control that enables Kubelet to set or update resources | ||
allocated to a Pod. | ||
|
||
#### Container Resize Policy | ||
|
||
To provide fine-grained user control, PodSpec.Containers is extended with | ||
ResizePolicy map (new object) for each resource type (CPU, memory): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we want to use a map here. See the Lists of named subobjects preferred over maps API convention. This also still feels like quite a large API change just to support legacy Xmx applications. But i'll defer to approvers on whether this is justified. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Using map here is consistent with ResourceAllocations (v1.ResourceList map) or Resources - map of Requests and Limits. I respect the API conventions, but if you visualize the two (see below) - map looks simpler considering LHS strings are fixed/known system-defined keys (and not user data, not something that helps the user if they could name/alias it). Is that convention applicable to this case? If you look at the intent of @jbeda in those discussions, he is looking to keep things simple.
vs.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ack. It is fine to leave it as is for now. This is worth pointing out to API reviewers later to get their opinion on. |
||
* NoRestart - the default value; resize Container without restarting it, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if there is a better name than ResizePolicy. Would adding another value for RestartPolicy (RestartPolicyContainer) be better ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to have this policy per container. |
||
* RestartContainer - restart the Container in-place to apply new resource | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How can the user forbid in-place updates for a given Container, e.g. because any resource change would require to re-run Init Containers? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a concrete use case or scenario that requires RestartPod policy? I removed it after the above discussion - I could not trace a use-case for it and couldn't justify its need with sig-node. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, I'm wondering if this can be folded into RetryPolicy in VPA where it lets the user specify this - similar to updateMode 'Recreate', if user needs to re-run init for resize we evict the Pod and resize the replacement during admission . This is less ideal than in-place resize with restart, but the push has been to keep things as simple as possible for the Kubelet, and this is something that can be added later if there is a strong use case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can see a case for a third policy here, which as a strawman I will call However, such a signal would only be useful if pods are able to interrogate the system for their own resource limits. This is perhaps best left to future enhancements to in-place update and should not block 1.17 implementation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On linux you can always read /sys/fs/cgroup can't you ? |
||
values. (e.g. Java process needs to change its Xmx flag) | ||
|
||
By using ResizePolicy, user can mark Containers as safe (or unsafe) for | ||
in-place resource update. Kubelet uses it to determine the required action. | ||
|
||
Setting the flag to separately control CPU & memory is due to an observation | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
that usually CPU can be added/removed without much problem whereas changes to | ||
available memory are more probable to require restarts. | ||
|
||
If more than one resource type with different policies are updated, then | ||
RestartContainer policy takes precedence over NoRestart policy. | ||
|
||
Additionally, if RestartPolicy is 'Never', ResizePolicy should be set to | ||
NoRestart in order to pass validation. | ||
|
||
#### CRI Changes | ||
|
||
Kubelet calls UpdateContainerResources CRI API which currently takes | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
*runtimeapi.LinuxContainerResources* parameter that works for Docker and Kata, | ||
but not for Windows. This parameter changes to *runtimeapi.ContainerResources*, | ||
that is runtime agnostic, and will contain platform-specific information. | ||
|
||
### Kubelet and API Server Interaction | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
When a new Pod is created, Scheduler is responsible for selecting a suitable | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Node that accommodates the Pod. | ||
|
||
For a newly created Pod, Spec.Containers[i].ResourcesAllocated must match | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Spec.Containers[i].Resources.Requests. When Kubelet admits a new Pod, values in | ||
Spec.Containers[i].ResourcesAllocated are used to determine if there is enough | ||
room to admit the Pod. Kubelet does not set Pod's ResourcesAllocated after | ||
admitting a new Pod. | ||
|
||
When a Pod resize is requested, Kubelet attempts to update the resources | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
allocated to the Pod and its Containers. Kubelet first checks if the new | ||
desired resources can fit the Node allocable resources by computing the sum of | ||
resources allocated (Pod.Spec.Containers[i].ResourcesAllocated) for all Pods in | ||
the Node, except the Pod being resized. For the Pod being resized, it adds the | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
new desired resources (i.e Spec.Containers[i].Resources.Requests) to the sum. | ||
* If new desired resources fit, Kubelet accepts the resize by updating | ||
Pod.Spec.Containers[i].ResourcesAllocated via pods/resourceallocation | ||
subresource, and then proceeds to invoke UpdateContainerResources CRI API | ||
to update the Container resource limits. Once all Containers are successfully | ||
updated, it updates Pod.Status.ContainerStatuses[i].Resources to reflect the | ||
new resource values. | ||
* If new desired resources don't fit, Kubelet rejects the resize, and no | ||
further action is taken. | ||
- Kubelet retries the Pod resize at a later time. | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Scheduler may, in parallel, assign a new Pod to the Node because it uses cached | ||
Pods to compute Node allocable values. If this race condition occurs, Kubelet | ||
resolves it by rejecting that new Pod if the Node has no room after Pod resize. | ||
|
||
#### Kubelet Restart Tolerance | ||
|
||
If Kubelet were to restart amidst handling a Pod resize, then upon restart, all | ||
Pods are admitted at their current Pod.Spec.Containers[i].ResourcesAllocated | ||
values, and resizes are handled after all existing Pods have been added. This | ||
ensures that resizes don't affect previously admitted existing Pods. | ||
|
||
### Scheduler and API Server Interaction | ||
|
||
Scheduler continues to use Pod's Spec.Containers[i].Resources.Requests for | ||
scheduling new Pods, and continues to watch Pod updates, and updates its cache. | ||
It uses the cached Pod's Spec.Containers[i].ResourcesAllocated values to | ||
compute the Node resources allocated to Pods. This ensures that it always uses | ||
the most recently available resource allocations in making new Pod scheduling | ||
decisions. | ||
|
||
### Flow Control | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The following steps denote a typical flow of an in-place resize operation for a | ||
Pod with ResizePolicy set to NoRestart for all its Containers. | ||
|
||
1. Initiating actor updates Pod's Spec.Containers[i].Resources via PATCH verb. | ||
1. API Server validates the new Resources. (e.g. Limits are not below | ||
Requests, QoS class doesn't change, ResourceQuota not exceeded...) | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
1. API Server calls all Admission Controllers to verify the Pod Update. | ||
* If any of the Controllers reject the update, API Server responds with an | ||
appropriate error message. | ||
1. API Server updates PodSpec object with the new desired Resources. | ||
1. Kubelet observes that Pod's Spec.Containers[i].Resources.Requests and | ||
Spec.Containers[i].ResourcesAllocated differ. It checks its Node allocable | ||
resources to determine if the new desired Resources fit the Node. | ||
* _Case 1_: Kubelet finds new desired Resources fit. It accepts the resize | ||
and sets Spec.Containers[i].ResourcesAllocated equal to the values of | ||
Spec.Containers[i].Resources.Requests by invoking resourceallocation | ||
subresource. It then applies the new cgroup limits to the Pod and its | ||
Containers, and once successfully done, sets Pod's | ||
Status.ContainerStatuses[i].Resources to reflect the desired resources. | ||
- If at the same time, a new Pod was assigned to this Node against the | ||
capacity taken up by this resource resize, that new Pod is rejected by | ||
Kubelet during admission if Node has no more room. | ||
* _Case 2_: Kubelet finds that the new desired Resources does not fit. | ||
- If Kubelet determines there isn't enough room, it simply retries the Pod | ||
resize at a later time. | ||
1. Scheduler uses cached Pod's Spec.Containers[i].ResourcesAllocated to compute | ||
resources available on the Node while a Pod resize may be in progress. | ||
* If a new Pod is assigned to that Node in parallel, it can temporarily | ||
result in actual sum of Pod resources for the Node exceeding Node's | ||
allocable resources. This is resolved when Kubelet rejects that new Pod | ||
during admission due to lack of room. | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Once Kubelet that accepted a parallel Pod resize updates that Pod's | ||
Spec.Containers[i].ResourcesAllocated, and subsequently the Scheduler | ||
updates its cache, accounting will reflect updated Pod resources for | ||
future computations and scheduling decisions. | ||
1. The initiating actor (e.g. VPA) observes the following: | ||
* _Case 1_: Pod's Spec.Containers[i].ResourcesAllocated values have changed | ||
and matches Spec.Containers[i].Resources.Requests, signifying that desired | ||
resize has been accepted, and Pod is being resized. The resize operation | ||
is complete when Pod's Status.ContainerStatuses[i].Resources and | ||
Spec.Containers[i].Resources match. | ||
* _Case 2_: Pod's Spec.Containers[i].ResourcesAllocated remains unchanged, | ||
and continues to differ from desired Spec.Containers[i].Resources.Requests. | ||
After a certain (user defined) timeout, initiating actor may take alternate | ||
action. For example, based on Retry policy, initiating actor may: | ||
- Evict the Pod to trigger a replacement Pod with new desired resources, | ||
- Do nothing and let Kubelet back off and later retry the in-place resize. | ||
|
||
#### Container resource limit update ordering | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
When in-place resize is requested for multiple Containers in a Pod, Kubelet | ||
updates resource limit for the Pod and its Containers in the following manner: | ||
1. If resource resizing results in net-increase of a resource type (CPU or | ||
Memory), Kubelet first updates Pod-level cgroup limit for the resource | ||
type, and then updates the Container resource limit. | ||
1. If resource resizing results in net-decrease of a resource type, Kubelet | ||
first updates the Container resource limit, and then updates Pod-level | ||
cgroup limit. | ||
1. If resource update results in no net change of a resource type, only the | ||
Container resource limits are updated. | ||
|
||
In all the above cases, Kubelet applies Container resource limit decreases | ||
before applying limit increases. | ||
|
||
#### Notes | ||
|
||
* If CPU Manager policy for a Node is set to 'static', then only integral | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
values of CPU resize are allowed. If non-integral CPU resize is requested | ||
for a Node with 'static' CPU Manager policy, that resize is rejected, and | ||
an error message is logged to the event stream. | ||
* To avoid races and possible gamification, all components will use Pod's | ||
Spec.Containers[i].ResourcesAllocated when computing resources used by Pods. | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* If additional resize requests arrive when a Pod is being resized, those | ||
requests are handled after completion of the resize that is in progress. And | ||
resize is driven towards the latest desired state. | ||
* Lowering memory limits may not always take effect quickly if the application | ||
is holding on to pages. Kubelet will use a control loop to set the memory | ||
limits near usage in order to force a reclaim, and update the Pod's | ||
Status.ContainerStatuses[i].Resources only when limit is at desired value. | ||
* Impact of Pod Overhead: Kubelet adds Pod Overhead to the resize request to | ||
determine if in-place resize is possible. | ||
* Impact of memory-backed emptyDir volumes: If memory-backed emptyDir is in | ||
use, Kubelet will clear out any files in emptyDir upon Container restart. | ||
|
||
### Affected Components | ||
|
||
Pod v1 core API: | ||
* extended model, | ||
* new subresource, | ||
* added validation. | ||
|
||
Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates: | ||
* for ResourceQuota it should be enough to change podEvaluator.Handler | ||
implementation to allow Pod updates, | ||
* to ensure alignment with current ResourceQuota behavior that blocks resources | ||
before they are used (e.g. for Pending Pods), we should do the following: | ||
* for requests.[cpu|memory], Pod's max(Spec.Containers[i].Resources.Requests, | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Spec.Containers[i].ResourcesAllocated) is used to compute Pod aggregate, | ||
* for limits.[cpu|memory], Pod's max(Spec.Containers[i].Resources.Limits, | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Status.ContainerStatuses[i].Resources.Limits) is used to compute aggregate, | ||
* for LimitRanger we check that a resize request does not violate the min and | ||
max limits specified in LimitRange for the Pod's namespace. | ||
|
||
Kubelet: | ||
* set Pod's Status.ContainerStatuses[i].Resources for Containers upon placing | ||
a new Pod on the Node, | ||
* update Pod's Spec.Containers[i].ResourcesAllocated upon resize, | ||
* change UpdateContainerResources CRI API to work for both Linux & Windows. | ||
|
||
Scheduler: | ||
* compute resource allocations using Pod.Spec.Containers[i].ResourcesAllocated. | ||
|
||
Controllers: | ||
* propagate Template resources update to running Pod instances. | ||
|
||
Other components: | ||
* check how the change of meaning of resource requests influence other | ||
Kubernetes components. | ||
|
||
### Future Enhancements | ||
|
||
1. Kubelet (or Scheduler) evicts lower priority Pods from Node to make room for | ||
resize. Pre-emption by Kubelet may be simpler and offer lower latencies. | ||
1. Allow ResizePolicy to be set on Pod level, acting as default if (some of) | ||
the Containers do not have it set on their own. | ||
1. Extend ResizePolicy to separately control resource increase and decrease | ||
(e.g. a Container can be given more memory in-place but decreasing memory | ||
requires Container restart). | ||
1. Extend Node Information API to report the CPU Manager policy for the Node, | ||
and enable validation of integral CPU resize for nodes with 'static' CPU | ||
Manager policy. | ||
1. Allow resizing local ephemeral storage. | ||
1. Allow resource limits to be updated (VPA feature). | ||
|
||
### Risks and Mitigations | ||
|
||
1. Backward compatibility: When Pod.Spec.Containers[i].Resources becomes | ||
representative of desired state, and Pod's true resource allocations are | ||
tracked in Pod.Spec.Containers[i].ResourcesAllocated, applications that | ||
query PodSpec and rely on Resources in PodSpec to determine resource | ||
allocations will see values that may not represent actual allocations. As | ||
a mitigation, this change needs to be documented and highlighted in the | ||
release notes, and in top-level Kubernetes documents. | ||
1. Resizing memory lower: Lowering cgroup memory limits may not work as pages | ||
could be in use, and approaches such as setting limit near current usage may | ||
be required. This issue needs further investigation. | ||
|
||
## Graduation Criteria | ||
|
||
TODO | ||
|
||
## Implementation History | ||
|
||
- 2018-11-06 - initial KEP draft created | ||
- 2019-01-18 - implementation proposal extended | ||
- 2019-03-07 - changes to flow control, updates per review feedback | ||
- 2019-08-29 - updated design proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please identify the component owners (for the autoscaling/node/scheduling areas) that will approve this KEP (and get approvals from them). That helps ensure there's agreement on the goals and overall approach before entering the API review process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt Thanks for pointing this out. I've identified the approvers for the stakeholder SIGs, and SIG-node, SIG-scheduling have approved the KEP.
@mwielgus is going to follow-up with @kgolab to see if there are any concerns, and if not we should get lgtm and approval from SIG-autoscaling.
Please let us know what our next-steps are for API review.
Thanks,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'd suggest:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt Thanks for the guidance. I've resolved many of the issues and comments that were either addressed or have become stale.
I'm tracking the remaining outstanding questions in #1287
I'll give folks a few days to re-open any that they may feel is not resolved or resolved in error.
And then I and @dashpole will ping @thockin to setup a time for API review.